I’m using pcap to capture TCP packets for which I would like to parse the payload. My strategy is as follows:
- Get the ethernet header and check if it has type
ETHERTYPE_IP(IP packet) - Check if the IP packet has protocol
IPPROTO_TCP(TCP packet) -
Check for payload size > 0
(size = ntohs(ip_header->total_length - ip->header_length*4 - sizeof(struct tcp_header)). -
parse payload (grab the host url)
I haven’t begun parsing the payload yet because I am getting discrepancies. Below is a printout of the payload of 10 captured TCP packets, using filter "host = www.google.com".
packet number: 3 : TCP Packet: Source Port: 80 Dest Port: 58723
No Data in packet
packet number: 4 : TCP Packet: Source Port: 58723 Dest Port: 80
No Data in packet
packet number: 5 : TCP Packet: Source Port: 58723 Dest Port: 80 Payload :
GET / HTTP/1.1
Host: http://www.google.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4
Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Cookie: THICNT=25; SID=DQAAAKIAAAB2ktMrEftADifGm05WkZmlHQsiy1Z2v-
Connection: keep-alive
packet number: 6 : TCP Packet: Source Port: 80 Dest Port: 58723
No Data in packet
packet number: 7 : TCP Packet: Source Port: 80 Dest Port: 58723 Payload:
\272نu\243\255\375\375}\336H\221\227\206\312~\322\317N\236\255A\343#\226\370֤\245[\327`\306ըnE\263\204\313\356\3268 )p\344\301_Y\255\267\240\222x\364
packet number: 8 : TCP Packet: Source Port: 58723 Dest Port: 80
No Data in packet
packet number: 9 : TCP Packet: Source Port: 80 Dest Port: 58723 Payload:
HTTP/1.1 200 OK
Date: Mon, 29 Nov 2010 10:11:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Server: gws
Content-Length: 8806
X-XSS-Protection: 1; mode=block
\213
Why is there a discrepancy in the payloads and the ports? Ideally I would like to only parse packets like packet 5. How do I ignore packets like 7 and 9?
Only by analyzing content. Nothing in IP or TCP header what can mark “HTTP Request” packets. Even “first data packet in connection” wouldnot work because there are persistent connections.
Also, to be completely sure about catching all URIs you need to reassemble TCP stream and parse HTTP request: URI can be split on two or more packets.