Wireshark-users: Re: [Wireshark-users] TCP question: retransmission or prodding the peer?

From: Bill Meier <wmeier@xxxxxxxxxxx>
Date: Thu, 20 Feb 2014 14:47:39 -0500
On 2/20/2014 2:21 PM, Bill Meier wrote:
On 2/20/2014 2:05 PM, Bill Meier wrote:
On 2/20/2014 6:04 AM, netztier@xxxxxxxxxx wrote:
Hi all

I am trying to track down a problem with an embedded device (card
reader, attached to a printer/copier) which is part of a "follow me
printing" solution:  User starts print job, walks to the next available
print machine, inserts card/badge, gets shown the list of his/her queued
jobs, selects one or more and prints it, and his card gets billed
by-the-page, etc etc.

This is usually done using a sequence of TCP sessions between card
reader and card server. Eventually, the card server will notify the
print server to push the selected print job to the printer, and will
maintain a flow of packets to the card reader, sending a billing
notification for every single page printed.

Every so often, there seems to be a stall in communication between the
card reader and the server, but only during the very first TCP session.

After the full three way handshake and two or three more packets, there
is a stall of ca 2.5seconds. This is delay is noticeable to the user -
and this is what we're trying to track down.

After that delay, the card server sends a new packet with:
   - 1 byte payload (and 1 byte less of padding in the IP header)
   - PSH set
   - the **same** SEQ/ACK numbers as the packet before the delay (see
frames 9 and 10).

A similar effect can be observed in frames 5 and 6, but there the
"delay" is only 7.5ms. This time, the card reader resends a packet to
the server.
   - 1 byte payload (and 1 byte less of padding in the IP header)
   - PSH set
   - the **same** SEQ/ACK numbers as the packet before the delay (see
frames 9 and 10).

The capture was done on a passive 10Mbit/s Hub between Card Reader's
switch port (Cisco2960S), using the onboard Intel NIC of a Lenovo T520.

I was considering that the card reader's ACK might have got lost
somewhere had CRC errors; or the Intel NIC might have them forwarded
them to libpcap.

However, I doubt that there are any invalid frames at all. During the
months we spent to track down the issue, the Cisco's switch port never
saw any invalid incoming frame (CRC, undersize etc), during the capture
with the 10Mbit/s hub, there wasn't even a single collision on that
given port, although it was running "10-half" at the time.

Upstream bandwidth from the access switch is plentiful, and we have no
indication that quality suffers anywhere in the network - and they're
doing VoIP and all.

QUESTIONS
=========
a) can these observations be called "retransmissions"?

No: see discussion below

b) if yes, is there a reason why Wireshark's  [ Version 1.10.5 (SVN Rev
54262 from /trunk-1.10) ] SEQ/ACK analysis would not detect them as
such?

N/A

c) are there any knobs to turn in Wireshark to make this form of
"retransmissions" show up ?


N/A

d) is sending "same SEQ/ACK plus PSH" a known form of "cattle prodding a
lagging TCP peer"?

N/A

e) is 2.5sec a known "wait time" or "timeout" in common TCP
implementations? (from which I will conclude that there must've been
some packet loss all the same)




Discussion:

1. It might be useful if you could provide a short capture of a good
sequence (without the 2.5 sec delay).

2. I have several observations:

    a. The basic request/response sequence as follows:

       Time     A ................        B ..............
1.   0.000000 --> 1 byte: seq:1
2.   0.200000                           <-- ack:2 seq:1 len:0
3.   0.200010 --> 90 bytes: s:2
4.   0.400000                           <-- ack:92 seq:1 len:0
      (interval)
5.   2.900000                           <-- 1 byte: ack:92 seq:1 len:1
6.   3.100000 --> ack:2 seq:92 len:0
7.   3.100010                           <-- 10 bytes: ack:92 seq:2
8.   3.300000 --> ack:11 seq:92


So: The fact that the seq & ack in 4 and 5 are the same is
     just as expected.
     packet 4 is just an "ack" with no data
     packet 5 is data (with same seq/ack as the previous)

However: for some reason, B took 2.5 secs to send (the start of)
          a response to packet 3 in packet 5.

          We know that B received packet 3 immediately because
          B sent an ack in packet 4 (after the usual 200 ms delay).

          So: The "B" application failed to respond immediately even
              though we know that "B" received the packet at the network
              level.

          I've idea as to why. Does "Only the during the first TCP
          connection" suggest some kind of initial setup
          going on in "B" ?



That being said: there's another issue having to do with the
    "send 1 byte", wait for ack, send remaining bytes" pattern.

    Rather than me trying to explain: Do a web search on "Nagle
    algorithm" and TCP_NODELAY for an explanation.

    Basically: the software isn't programmed quite right (IMHO).

Another thing I find a bit interesting:

The widow size advertised by B (card server ?)just keeps decreasing as
data is received from A. Normally that would mean that the app isn't
taking the data from the network layer. However, that appears not to be
the case since the request/response sequence seems to complete OK.

What kind of system is the card server. Some kind of minimal system ?



Actually: I see that the continually decrementing window size
advertisement applies to both the card reader and the card server.

Given that we're talking embedded devices, have you discussed this issue
with the vendor ?


Thinking about this a bit more:

It's certainly possible that the issue is lost data from the server to the reader.

IOW: packet 5 above is actually a retransmission which eventually makes it through. Depending upon the TCP implementation, it could be that the retransmission timeout is 2.5 secs.


I would guess that the first step would be to do a capture adjacent to the server to rule that possibility out.