Ethereal-users: [Ethereal-users] GOT IT! (Viewing this capture to help find the problem)

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: "Mark Holloway" <mholloway@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 23 May 2003 09:59:04 -0700
Thanks Ronnie and Richard - your input helped tremendously!  I'll take a moment to explain what it was and how both of you hit the nail on the head just by looking at the capture.  Most of all, I want you guys to know I learned something from this experience.  I learned how to better analyze an issue when the system administrators come to me and say the old cliché' "there is something wrong with the network".  I've dealt with that before, but mostly in a Microsoft world.  Troubleshooting this helped me with a better approach of verbal communication with the administrators, I learned how to better analyze a capture, and I really felt I was looking at the root of the data as opposed to relying on an "expert" system which can report false positives, since the expert is only as good as the thresholds it's configured with.

Breakdown - 

My company has several data partners (pharmacies) throughout the country in which we receive pharmacy prescription information over a large frame relay network.  The data is sent to a Stratus server.  The Stratus has an application called "Route" where it stores this information.  There is another application on the Stratus, independent of Route, called Agent, that has 20 listener-connections to the AS400.  As data comes into Route and fed into Agent, Agent delivers it to the AS400 on any on of the listener sockets.  If no data exists, a keep-alive from the Stratus to he AS400 maintains connectivity to the session is never torn down.  

Now, for the bottleneck.  Once the AS400 received the data, it goes through a special formatting process so it can hand it off to a SQL database environment running on another server.  In addition, this AS400 server also runs "PBM Claims Adjudication" application.  The AS400 administrators told us today that they have split the PBM process (effectively doubling the load) and ALSO have increased the amount of CPU resource dedicated to formatting data for the SQL server.  I insisted we look at the CPU and DISK performance of the AS400.  At 8:00 AM EST the system is running under extremely heavy load, but at 11:00 AM, when the pacific companies come online, the system is pegged at 90% or greater.  As you can see, this is the typical sign of slow network response, dropped packets, and overall poor performance.  

Had it not been for your help, which resulted in me going to the Stratus administrator (who, along with the AS400 people, thought it was definitely a network related problem), I don't know if I would have a valid way of stating we should re-visit the AS400 and see if it has a bottleneck that is causing it to choke.  I will state that in the beginning I did ask the one of the AS400 people to check CPU utilization and whether they really did or not, I don't know.  They told me everything looked good.  I will say that I know why they cranked up processing on the AS400.  My company has sold one division (PBM) to another company and the handoff of that takes place in the next 60 days, so data is being replicated and backed up.  On the other side, the Stratus/AS400/SQL data format, we are a rapidly growing company and trying to squeeze everything we can into what we have.  We are currently evaluating Sun hardware to replace the AS400.  

Thanks again for all of your help.

Regards,
Mark Holloway




Regards,
Mark Holloway
Sr. Network Engineer - Arclight Systems, LLC
702-253-3861 // mobile 702-349-6170


-----Original Message-----
From: Richard Urwin [mailto:] 
Sent: Friday, May 23, 2003 2:01 AM
To: Mark Holloway; ethereal-users@xxxxxxxxxxxx
Subject: RE: [Ethereal-users] Viewing this capture to help find the problem

Most/All of the conversations follow the following pattern absolutely
consistently:

AS400 sends data.
Very quickly (<2ms) the Stratus acknowledges and replies with its own
data.
The AS400 then takes up to 500ms to send more data. It's TCP stack
(correctly) times-out after 20-40ms and sends an acknowledgement, but
the next data is not sent for 150-450ms after that.

(To see this, create a colourisation filter of "tcp.analysis.ack_rtt >
0.01", set the time display option to "seconds since previous frame",
follow any TCP stream and close the stream window, (this sets up the
correct display filter to see only one conversation.))

I would guess that this is a command-response protocol, and that the
AS400 is taking far too long to process data and return a result.

I don't have a lot of experience with the TCP graphs, but all the (port
22211) conversations have graphs of the same shape. This indicates to me
that there is some overlying factor that is affecting all of them.
(Processor load maybe?)

There is another conversation in the file between the AS400 and
10.11.100.58 that is strikingly different and seems to be working fine.
Unfortunately the capture only has half of the conversation, so we
cannot see the AS400 latency times.

--
Richard Urwin, Private
"No 9000 series computer has ever made a mitsake or corrubiteddatatato."

-----Original Message-----
From: Mark Holloway [mailto:mholloway@xxxxxxxxxxxxxxxxxxx]
Sent: 22 May 2003 20:03
To: ethereal-users@xxxxxxxxxxxx
Subject: [Ethereal-users] Viewing this capture to help find the problem


I am having HUGE problems with TCP communications between an AS400 and
Stratus (running VOS).  The Stratus has data it is sending to the AS400
on TCP port 22211.  I have confirmed the category 5 cable is good, the
switch ports are good, and everything is set for 100 Full Duplex.  
 
I have posted a capture that is available for download (11MB) at
http://www.markholloway.com/arclight.cap 
 
I setup a mirror port on my Foundry switch and filtered the capture for
host 10.11.2.4.  
 
10.11.2.4 = AS400
10.11.2.5 = STRATUS
 
I appreciate any help or feedback in regards to troubleshooting.  I have
been trying to resolve this but I am stuck!  I had another guy run
Sniffer on this capture and the Expert reporting shows long ack time and
retransmitted packets.  I'm new to Ethereal, so I'm not sure of the best
method to 'search' for anomalies. 
 
Regards,
Mark Holloway
Sr. Network Engineer - Arclight Systems, LLC
702-253-3861 // mobile 702-349-6170
 

________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________

________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs Email
Security System. For more information on a proactive email security
service working around the clock, around the globe, visit
http://www.messagelabs.com
________________________________________________________________________