Menu

Network Troubleshooting Quick Reference Guide

If you think you have a network problem, try some of the commands in the table below.

Note that some of these assume the perfSONAR tools are installed on 1 or both endpoints, and some assume you have login access to both endpoint. For information on installing the perfSONAR tools, see the Network Test Tools page.

Also note that there is no best order to try these commands, but the more you run, the more you'll learn. Try to test the shortest network segment possible first. See the perfSONAR Directory Service for a list of public measurement hosts that might be along your path of interest.

Test Sample Commands Problem Detected More Information
1) obvious packet loss problems
  • mtr hostname
  • ping -c 1000 -i .2 hostname
  • owping hostname
  • owping -c 10000 -i .01 hostname
  • pscheduler task latency --dest receive host --packet-count 1000 --packet-interval .01
  • Congestion
  • Dirty connections

mtr

owamp

pScheduler

2) path and MTU problems

  • traceroute hostname
  • tracepath hostname
  • ping -s 8900 -M do -c 4 hostname
  • pscheduler task mtu --dest recv_host --source send_host
  • pscheduler task --tool tracepath trace --source send_host --dest receive_host
  • routing problems
  • MTU problems

MTU Issues

pScheduler

3) host problems
  • pscheduler task throughput --dest hostname 
  • pscheduler task throughput --dest hostname --source hostname2 
  • pscheduler task throughput --dest hostname --tool iperf --window-size 64M
  • pscheduler task throughput --dest hostname --tool nuttcp
  • TCP Tuning
  • Underpowered host

Host Tuning

pScheduler

4) network buffer problems
  • nuttcp -u -Ri300m/100 -i 1 -T5 -w1m hostname 
  • nuttcp -u -Ri300m/300 -i 1 -T5 -w1m hostname
  • Switch/Router Buffer issues

Buffer Issues

Buffer Testing

5) subtle packet loss problems
  • pscheduler task throughput --dest hostname --tool iperf -u --bandwidth 500M
  • nuttcp -l8972 -T30 -u -w4m -R3G -i1 hostname
  • bad fiber connections
 

What to look for:

Test 1) These commands will tell you if there are packet loss problems. mtr will give you a hop by hop estimate of packet loss. ping and owamp will give you an end-to-end estimate of packet loss. Some problems only show up with a larger number of packets, so try both the default settings, followed by increasing the packet rate for ping and owamp. If you see lots of loss, it might be due to congestion, or it might be due to bad optical connections. Dirty connections includes bad optics, kinked fibers, and a long list of possible problems. These can be hard to track down, and you may need to log into every network device along the path and look at error counters. To best way to confirm congestion is to look at SNMP counters of the network devices, but this may not always be possible.

Test 2) These commands will help you find MTU issues and asymmetric paths. Some paths might be asymmetric by design, and if the latency is similar, it should not impact performance. But a path that is 100x longer in one direction than the other is a problem. If either of your hosts are configured to use jumbo frames, check for MTU issues along the path.

Test 3) These commands will help determine if there is a host tuning issue. Look at the output for Cwnd for iperf3 to make sure you are not TCP window limited. Look at the output for nuttcp to see if you are CPU limited at the send or receive host.

Test 4) A common problem that leads to packet loss is a network device with buffers that are too small. A good way to test for that is using the 'burst mode' for nuttcp. If a burst of 100 packets is fine, but a burst of 300 has packet loss, then there is a good chance there is an under-buffered device somewhere along the path. Unfortunately it can be quite difficult to figure out which device is the problem.

Test 5) For very fast networks such as 40G and 100G, the fiber connections must be perfectly clean. Often you'll only notice loss on very fast flows. The best way to test these links is with a UDP test of greater than 2Gbps. Always be careful when doing high-speed UDP tests to ensure you wont impact production traffic.

More details on all these issues can be found under fasterdata's Troubleshooting Guide.