Wireshark-users: Re: [Wireshark-users] TCP question: retransmission or prodding the peer?

Date: Fri, 21 Feb 2014 08:23:53 +0000 (GMT)
Hi Bill

Thanks for picking this up. I'll do my best to answer and comment your statements

>On 2/20/2014 2:21 PM, Bill Meier wrote:
>> On 2/20/2014 2:05 PM, Bill Meier wrote:
>>>
>>> Discussion:
>>>
>>> 1. It might be useful if you could provide a short capture of a good
>>> sequence (without the 2.5 sec delay).

I'll do that, later on.

>>> 2. I have several observations:
>>>
>>>     a. The basic request/response sequence as follows:
>>>
>>>     [ SEQ/ACK ] analysis snipped
>>>
>>> So: The fact that the seq & ack in 4 and 5 are the same is
>>>      just as expected.
>>>      packet 4 is just an "ack" with no data
>>>      packet 5 is data (with same seq/ack as the previous)
>>>
>>> However: for some reason, B took 2.5 secs to send (the start of)
>>>           a response to packet 3 in packet 5.
>>>
>>>           We know that B received packet 3 immediately because
>>>           B sent an ack in packet 4 (after the usual 200 ms delay).
>>>
>>>           So: The "B" application failed to respond immediately even
>>>               though we know that "B" received the packet at the network
>>>               level.

A (10.33.53.121) is the card reader and TCP initiator, while B (147.88.243.121) is the card server and TCP responder.

I beg to disagree here: (if we remove the two ARP packets from the capture and restart numbering from 1):

Packet 3 of the TCP session (packet 5 in the originally attached capture) was the A's 3rd packet of the 3-way-handshake. 
Packet 4 is such a suspicous "retransmission" after 7.5 ms, with a 1byte payload (Kind of a request?)
Packet 5 is B's response to packet 4, an empty ACK
Packet 6 is probably A's request to B, the first packet with some discernible payload, and probably is a "true" request of something. 
Packet 7 is B's ACK to packet 6, with a delay of 0.2 seconds
Packet 8 is B's ACK to ... (what?) with the delay of 2.5 seconds and a 1 byte payload.
Packet 9 is A's ACK to packet 8

etc... ad FIN


>>>           I've idea as to why. Does "Only the during the first TCP
>>>           connection" suggest some kind of initial setup
>>>           going on in "B" ?

That is what I assume. User swipes the card - card reader contacts server. 
This first TCP session is always this short, just 18 or 20 packets.  But a lot 
of them have this strange delay between packets 7 and 8.  The next TCP 
sessions follow immediately (retrieval of print job list), and while printing, 
there is a longish one, probably for the "live billing" to the card of each single 
page printed.

>>> That being said: there's another issue having to do with the
>>>     "send 1 byte", wait for ack, send remaining bytes" pattern.
>>>
>>>     Rather than me trying to explain: Do a web search on "Nagle
>>>     algorithm" and TCP_NODELAY for an explanation.
>>>
>>>     Basically: the software isn't programmed quite right (IMHO).

If i understand the bit with TCP_NODELAY correctly, setting this socket option when 
calling a socket causes TCP to send every data chunk that gets "pushed down the 
stack" from the application immediately.

Given the application's nature, and the requirement that every page printed must be 
reported back to the card reader (and written to the card), I think that disabling Nagle
is not quite wrong. 

>>> Another thing I find a bit interesting:
>>> The widow size advertised by B (card server ?)just keeps decreasing as
>>> data is received from A. Normally that would mean that the app isn't
>>> taking the data from the network layer. However, that appears not to be
>>> the case since the request/response sequence seems to complete OK.

The servers (B, 147.88.243.205) advertised window size starts with 8k in the SYN ACK, 
then jumps to almost 64240 bytes and decreases down to 64102 bytes.

>>> What kind of system is the card server. Some kind of minimal system ?

Currently, I do not know. I assume it is a Windows server.

>> Actually: I see that the continually decrementing window size
>> advertisement applies to both the card reader and the card server.

Agreed, the card reader starts at 32k, increases to 33580 and decreases to 33551.

>> Given that we're talking embedded devices, have you discussed this issue
>> with the vendor ?

The issue in its early days had been tracked with the vendor. 

Back then however, the network did actually have a problem: 
restrictive port security timers aged-out the card reader's MAC 
address from the CAM table prematurely

When it was not used for more than 5min, it had cleared it's 
own ARP table, and had to start with ARPing for it's default gateway.

With port security enabled, MAC-learning on the Cisco 2960S is 
done in software on the CPU, not the port ASICs. The first ARP 
request was lost and the card reader had to retransmit it, 
sometimes even twice. 

We worked around this by increasing port security timers and by 
reducing arp timeout on the upstream L3-Switch to 4 minutes.
Cisco L3 devices with CEF enabled perform active ARP cache maintenance. 

1min before expiry, they unicast-request an ARP resolution from all
known entries for a given subnet/interface. This request/reply sequence
every 3 minutes keeps the CAM table entires alive. 

No more lost ARPs since.

>Thinking about this a bit more:
>
>It's certainly possible that the issue is lost data from the server to 
>the reader.
>
>IOW: packet 5 above is actually a retransmission which eventually makes 
>it through. Depending upon the TCP implementation, it could be that the 
>retransmission timeout is 2.5 secs.

I would agree that frame 8 with it's 2.5s delay is a retransmission that 
eventually gets through, but then again... 

We've seen frame 7 get to the card reader.

If the card reader's (A's) reaction to frame 7 were lost somewhere  upstream 
in the network, the capture should've seen it go from card reader to 
switchport.  The ethernet hub I used to capture was between the card reader 
and its usual switchport - and that switchport hasn't seen a malformatted 
frame in months.  

So for some reason, there was no reaction from the card reader to frame 7. 
This might be perfectly ok - probably there is no reaction required.

And: If it were packet loss due to corruption (bad cabling) or congestion, 
I believe it would have to be random. Always missing a packet after 
"packet 7"  isn't random.

Of all the samples i have (more than a dozen), the 2.5s delay is always 
between packet 7 and 8.

So if there actually is a packet missing from card reader (A) to Server (B), 
it never left the card reader.  Which brings us back to the vendor.

>I would guess that the first step would be to do a capture adjacent to 
>the server to rule that possibility out.

That's what I'll attempt next. I hope it is a server that runs on a platform
where i can capture directly on the server.

Capture of a non-delay-affected session will follow.

Best regards & thanks a lot

Marc