Wireshark-users: Re: [Wireshark-users] Network disconnects

From: Hansang Bae <hbae@xxxxxxxxxx>
Date: Tue, 19 Feb 2008 22:15:59 -0500
Andy Alguire wrote:
Here are the symptoms:

- user workstations freeze and access to shared drives and email terminates
- interruption typically lasts 10 to 30 seconds
- interruptions occur rarely and randomly during the day, but consistently at day end (4:30 to 6PM) - most users have left the building by 5PM but interruptions continue until 6PM and sometimes later
- when users leave they logout and shut down their PCs
- Novell 6.5 network with several Windows 2000 application servers
- Novell GroupWise email
- Nortel Baystack workstation switches connected via switch backplanes
- Cisco 3750 core switch
- 1 server Vlan and 1 workstation Vlan

- to date we have upgraded Netware client on all workstations, upgraded firmware and software on switches, and eliminated legacy D-Link switches - network performance is excellent until the interruptions occur
- we are considering the possibility of an environmental cause but nothing obvious has come to light


I hope you didn't pay your consultants. It's time to divide and conquer. For example, on one PC create a batch file that will repeatedly copy files *TO* the server and timestamp each file copy. Another batch file can copy FROM the server and timestamp each file copy. Use very small files. You're not testing throughput, just connectivity at NCP level. On those same PCs, do the same for copying to/from Windows servers. Now do the same for the mail server. Finally, have a pair of PC's copy files back and forth between themselves. These PC's should not be running GroupWise in case *that* is the problem. Often times, lock ups are not at the network layer....that is, a problematic application can lock up the entire PC.

You will also need to create ping loops. These will also ping the server every second and timestamp it. If SMB/NCP (file copy itself) is causing you the problem, it will not affect IP only ping packets.

Knowing the exact outage duration can help you ferret out the problem as well. For example, switch ports going through spanning tree calculation have a very specific timers involved.

It's 10PM, do you know who your root bridge is?!? It's important because one rogue users bringing in a switch can cause your spanning tree to recalculate. But I would expect the outage to last more than 10 seconds in this case.

Don't forget to ping the routers connecting the server and user vlan as well. By pinging at different levels of the network (PC to server, pc to pc, pc to server via the router etc.) you may be able to spot a common failure point.

Do you have a logs available on the baystack and the cisco switches? Do you see a common port that transitions from up to down right about the outage time frame?

If your earlier post is indicative of the problem, you may indeed be losing gobs of packets. If so, any of the above batch files will find it.

Good luck.


--

Thanks,
Hansang