-----Original Message-----
From: wireshark-users-bounces@xxxxxxxxxxxxx [mailto:wireshark-users-bounces@xxxxxxxxxxxxx] On Behalf Of Jeffs
Sent: mercredi 11 août 2010 15:07
To: Community support list for Wireshark
Subject: Re: [Wireshark-users] filter for ONLY initial get request
>
> This formula, however, only returns results minus the links and images
> embedded in the web page:
>
> tshark -r test.cap -T fields -e http.host | sed 's/?.*$//' | sed -n
> '/www./p' | sort | uniq -c | sort -rn | head -n 100
>
> 15 www.propertyshark.com
> 8 www.nytimes.com
> 2 www.google-analytics.com
> 1 www.facebook.com
>
>
> However, I am new to regex so I'm sure I may be missing something or
> losing some links.
>
It is a common mistake to consider that every websites have their main
address on a "www" subdomain. If you want a generic filter, you cannot
rely on it. If you want a relevant result, you'll have to build a
non-restrictive regexp and manually filter unappropriate results,
eventually making some rules to exclude well-known advertising sites.
A fully automatic solution would be to parse the data checking it is
a well-formed html (or xml or plain-text) document. This will purge
videos and images from your results.