Wireshark · Ethereal-users: [Ethereal-users] Script to search TCP packet payloads for arbitrary strings in arbitratry locations.

Ethereal-users: [Ethereal-users] Script to search TCP packet payloads for arbitrary strings in a

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Follower <follower@xxxxxxxxxxxxx>

Date: Sun, 17 Nov 2002 18:13:28 +1300

Hi,

Ethereal's been a great help to me, but the one piece of functionalityI've needed which it doesn't have (as far as I could tell) is theability to search TCP packet payloads (for me, HTTP transactionsspecifically) for arbitrary strings located anywhere in the payload.

I needed this for a some trouble-shooting I was doing, so I wrote aPython script to do it for me. It outputs an Ethereal-style displayfilter containing the TCP sequence numbers of the matching packets,which can then be pasted into a display filter entry box. (As this workson the raw 'pcap' file it could be used for any files of that format.)

Thought someone else might find it useful so I've tidied it up andattached it to this email. (My apologies if that's against listetiquette, looking at the list archive I couldn't determine if it was ornot...) Further documentation on usage and implementation is includedinternally, probably requires Python 2.2 but way work on lower versions.

Eventually I'll put it up on http://www.rancidbacon.com/ as well, butwho knows how long that will take! :-)


Hope this is useful,

Phil.

#!/usr/bin/python
#
# Name:
#   find_text
#
# Description:
#   A script to search an (uncompressed) 'pcap' format capture file and find
#   arbitrary strings at arbitrary offsets within IP v4 TCP packet payloads.
#
#   Once all the matches are found, an Ethereal-style display filter containing
#   the TCP sequence numbers of the matching packets is output, this can then
#   copied and pasted into an Ethereal filter entry box.
#
#   Usage Example 1:
#     find_text data-dump.pcap myString 192.168.96.200 192.168.61.223
#
#   Usage Example 2:
#     find_text data-dump.pcap myString
#
#   Where:
#     'data-dump.pcap' is the source file (uncompressed)
#     'myString' is the string to search for. (Case sensitive, no white space.)
#      '192.168.96.200'
#         & '192.168.61.223' are the ip addresses in the source or destination
#                            fields, which are used to find the start of the
#                            packet. (If they are not supplied a different,
#                            potentially less reliable, but more convenient,
#                            method is used.)
#
#   Example Output:
#     Searching file 'data-dump.pcap' for packets which contain 'myString'...
#     Matches found: 2
#     Ethereal filter string:
#     (tcp.seq == 2215321831) || (tcp.seq == 2219942059)
#
#   False positives:
#     If a match is found in an packet header (rather than the payload) or
#     in a non TCP packet the sequence number for that match will be arbitrary.
#
# How it works (short):
#   This script searches the capture file for the string requested, and then,
#   working back from that point, it looks for the start of the packet that
#   contains the match (using the source/dest IP pair to mostly-reliably
#   find the start of the IP packet header, although another method can be
#   used instead), then it finds the sequence number of the TCP packet that
#   contains the string and records it.
#
# How it works (long):
#   Rather than treating the capture file as a series of packets we ignore that
#   fact and simply treat it as a bunch of bytes to search to find a match with
#   our string. Only when we find a match do we start worrying about what
#   packet we're in.
#
#   The advantage of this is that we don't initially need to know anything
#   about the format of the captured packets, thus we can still find matches
#   even if the packets are buried amongst a bunch of non-TCP/IP packets.
#
#   The downside is that we need a method to locate the TCP sequence number
#   of the packet which contains the match from an arbitrary location in
#   its payload.
#
#   This script currently offers two different approaches to determining this:
#   (If there are other methods that work more reliably they could be added.)
#
#   Method One:
#   Use known source and destination IP addresses to find the start of the IP
#   packet and then find the TCP sequence number within the TCP header.
#
#   Once we've found a match we search backwards from it to locate the first
#   sequence of 8-bytes which matches either <IP1><IP2> or <IP2><IP1>.
#
#   This method tends to be reliable as the likelihood of the 8-byte sequence
#   (i.e. two 4-byte IP addresses) being found outside of the packet headers
#   is low (although I guess it would depend on what you were capturing).
#
#   The downside of this approach is that you have to specify the IP addresses.
#
#   Method Two:
#   Find the start of the IP header by searching for known values at known
#   offsets within it.
#
#   Because we restrict this script to matches within IP v4 and TCP packets
#   there are certain things we know about the content of the IP header:
#
#   * The value of the protocol byte is: 0x06 (TCP)
#   * The protocol byte is a static offset (9 bytes) from the start of the
#     IP header.
#   * The first byte of the IP header contains the IP version nibble.
#   * The IP version nibble is always 0x4 (since we restrict it to that).
#
#   Thus, to find the start of the IP header, we locate the first occurence
#   (searching backwards from our match) of a byte with the value 0x06 that
#   is 9 bytes away from a byte with the top nibble equal to 0x4.
#
#   Unfortunately this isn't enough to reliably detect the start of the
#   IP header, thus, to improve the reliability, we also need to know:
#
#   * The other nibble in the byte containing the IP version nibble is the
#     IP header length (the number of 32-bit words) which, to be valid,
#     is a minimum of 5.
#
#   This additional restriction increases the reliability.
#
#   Needless to say, this method isn't full-proof but it's worked well
#   enough for me to be useful. YMMV. (I've only used it to search payloads
#   from HTTP transactions however...)
#
#   (There are at least two situations where a false match will occur:
#    1: Your data happens to have a sequence of bytes which exactly
#       match the above restrictions.
#    2: The Source IP address has a '6' as the last byte, the FLAGS nibble
#       of the IP header has a value of 4 (common) and the top four bits
#       of the fragment number is greater or equal to 5 (unlikely?).
#   )
#
#  Implementation Notes:
#    This implementation reads the capture file in chunks of CHUNK_SIZE bytes,
#    which reduces its memory requirements when searching large captures.
#
#    * Matches which are split over chunk boundaries *ARE* found BUT...
#    * If the IP header is in one chunk and the match is in the next then
#      the header WILL NOT be found. (You'll get a message
#      "Error: Can't find start...".)
#
#    I've never hit this problem, but you might. (A workaround is to set a
#    different chunk-size.) There's no reason why the code couldn't be
#    changed to allow this to work though.
#
# Author:
#   follower@xxxxxxxxxxxxxxx
#
# Version history:
#   v0.0.3 -- 17 November 2002
#             First public release, with added dcoumentation.
#
#   v0.0.2 -- Added second method of finding IP header start.
#
#   v0.0.1 -- Initial version.
#
import sys

import socket

import struct

import string

CHUNK_SIZE = 1024*1024

# Number of bytes from start of ip packet to src ip
IP_PACKET_SRC_DEST_OFFSET = 12

# Number of bytes from start of ip packet to packet length
IP_PACKET_LENGTH_OFFSET = 2

# Number of bytes from start of ip packet to protocol type identicator
IP_PACKET_PROTOCOL_OFFSET = 9

IP_PACKET_PROTOCOL_TYPE_TCP = 0x6
IP_VERSION = 0x4

# Number of bytes from start of tcp packet to sequence number
TCP_PACKET_SEQ_NUM_OFFSET = 4


def getIPHeaderStart_viaIpPair(buffer, payloadOffset, ip1, ip2):
    #
    # Tries to find the start of the IP header for this packet by
    # searching for known IP addresses within it.
    #
    # Returns -1 if start not found.
    #
    ip1Toip2 = ip1 + ip2
    ip2Toip1 = ip2 + ip1
    
    srcDestPair1 = buffer.rfind(ip1Toip2, 0, payloadOffset)
    srcDestPair2 = buffer.rfind(ip2Toip1, 0, payloadOffset)

    srcDestPair = max(srcDestPair1, srcDestPair2)

    if (srcDestPair > -1):
        ipPacketStart = srcDestPair - IP_PACKET_SRC_DEST_OFFSET
    else:
        ipPacketStart = -1

    return ipPacketStart        


def getIPHeaderStart_viaProtocolType(buffer,
                                     payloadOffset,
                                     protocolType = \
                                     IP_PACKET_PROTOCOL_TYPE_TCP):
    #
    # Tries to find the start of the IP header for this packet by
    # searching for known values in the IP header.
    #
    # Returns -1 if start not found.
    #
    possibleIpProtocolTypeOffset = 0
    possibleIpProtocolTypeOffset = payloadOffset

    while (possibleIpProtocolTypeOffset > -1):    
    
        possibleIpProtocolTypeOffset = \
                                     buffer.rfind(chr(protocolType), 0,
                                                  possibleIpProtocolTypeOffset)

        possibleIpPacketStart = possibleIpProtocolTypeOffset - \
                                IP_PACKET_PROTOCOL_OFFSET

        possibleIpProtocolVersion = ord(buffer[possibleIpPacketStart]) >> 4

        # Note: If the header length is less 5 x 32 bits then it's invalid...
        # (This check reduces the likelihood of false positives somewhat,
        # coincidently, if you have a source IP address ending in '.6 ' and the
        # 'flags'+'fragment offset' (first 4 bits) of '0x40' then
        # they are the same offset apart as the one we're looking for...)
        # (Even with this check I guess it's still possible if the first four
        # bits of the 'fragment offset' are >= 5 anyway...
        # TODO: Can we calculate any other checks?)
        headerLength = ord(buffer[possibleIpPacketStart ]) & 0xf

        # print ">> ver&len", hex(ord(buffer[possibleIpPacketStart]))

        if ((possibleIpProtocolVersion  == IP_VERSION) and
            (headerLength >= 5)):
            return possibleIpPacketStart
        
    return -1
        
    
def getIPHeaderStart(buffer, payloadOffset, ip1 = None, ip2 = None):
    #
    # Tries to find the start of the IP header for this packet by
    # one of two methods, depending on the arguments supplied.
    #
    if ((ip1 == None) or (ip2 == None)):
        return getIPHeaderStart_viaProtocolType(buffer, payloadOffset)

    return getIPHeaderStart_viaIpPair(buffer, payloadOffset, ip1, ip2)

# Check for the correct arguments...
if ((len(sys.argv) <> 5) and (len(sys.argv) <> 3) ):
    print "Usage : find_text <pcap file> <string> [<ip1> <ip2>]"
    sys.exit(1)


sourceFile = sys.argv[1]

targetStr = sys.argv[2]

targetStrLen = len(targetStr)

if (len(sys.argv) == 5):
    ip1 = socket.inet_aton(sys.argv[3])
    ip2 = socket.inet_aton(sys.argv[4])
else:
    ip1 = None
    ip2 = None

# TODO: Check file exists...
hSource = open(sourceFile, "rb")

print "Searching file '%s' for packets which contain '%s'..." \
      % (sourceFile, targetStr)

buffer = hSource.read(CHUNK_SIZE)

prevBuffer = ""

matchedSeqNums = []

# Find the matches...
# TODO: Document this better...
matchCount = 0
while (len(buffer) > (targetStrLen - 1)):
    # print "[%s]" % buffer search buffer
    location = 0
    nextStart = 0
    while (location > -1):
        location = buffer.find(targetStr, nextStart)

        if (location > -1):
            # Do stuff...
            # print "Found match"
            # Uncomment the following lines to display details as we find them.
            # print buffer[location:location+targetStrLen], 
            # print hex(hSource.tell()), \
            #       hex(hSource.tell()-(len(buffer) - location)), 

            matchCount+=1

            nextStart = location +  targetStrLen

            # Now that we have a possible data point we need to find the start
            # of the TCP packet so we can get the sequence number...
            # We know the source/destination address so we use this to find the
            # start of the IP packet first.
            # TODO: allow it to start in a previous buffer...
            ipPacketStart = getIPHeaderStart(buffer, location, ip1, ip2)

            if (ipPacketStart > -1):            

                # TODO: This check is redundant for find viaProtocolType...
                if (ord(buffer[ipPacketStart + IP_PACKET_PROTOCOL_OFFSET]) == \
                    IP_PACKET_PROTOCOL_TYPE_TCP):

                    # IP Header length in bytes...
                    ipHeaderLen = (ord(buffer[ipPacketStart]) & 0x0f) * 4 
                    
                    tcpPacketStart = ipPacketStart + ipHeaderLen

                    tcpSeqNumOffset = tcpPacketStart + \
                                      TCP_PACKET_SEQ_NUM_OFFSET

                    tcpSeqNumStr = buffer[tcpSeqNumOffset:tcpSeqNumOffset + 4]

                    # Note: This isn't correct if it's SYN--
                    #       but don't think that is likely...           
                    tcpSeqNum = struct.unpack("!L", tcpSeqNumStr)[0]
                    
                    #print "Sequence number: %d (0x%x)" % \
                    #      (tcpSeqNum, tcpSeqNum)

                    if (tcpSeqNum not in matchedSeqNums):
                        matchedSeqNums.append(tcpSeqNum)             

                else:
                    print "Error: Packet is not TCP"
    

            else:
                # TODO: get previous buffer if needed...
                print "Error: Can't find start..."
                                        
        
    # get last 'n - 1' bytes of the buffer
    # (So we don't miss matches over chunk borders)
    buffer = buffer[-(targetStrLen - 1):]

    # append next chunk to buffer
    buffer += hSource.read(CHUNK_SIZE)

# Tidy up...
hSource.close()

# Show results...
print "Matches found: %s" % (matchCount)

if (matchCount > 0):
    print "Ethereal filter string:"
    print string.join(map(lambda x : "(tcp.seq == %s)" % x, matchedSeqNums),
                      " || ")

Prev by Date: RE: [Ethereal-users] (no subject)
Next by Date: [Ethereal-users] capturing scsi data
Previous by thread: Re: [Ethereal-users] IMPORTANT !!!
Next by thread: [Ethereal-users] capturing scsi data
Index(es):
- Date
- Thread