Troubleshooting Overview

There a number of good, very detailed network troubleshooting guides have been written, these steps are aimed at Network or System Administrators. The goal of this site is to also help network users, not just network wizards.

ESnet Guide to Diagnosing Performance Problems

The following is an example step-by-step guide to track down a problem. These steps may or may not work for you, as these assume you have access to both end systems, have the ability to install pScheduler and iperf, and have a few perfSONAR hosts along your path of interest. End-to-end troubleshooting is MUCH harder without intermediate test hosts.

1) Scope the problem

  • Identify the Data Source IP address & host system: OS, version, distribution, hardware, etc.
  • Identify the Destination IP address & host system: OS, version, distribution, hardware, etc.
  • Identify the data transfer tool used. Is there a better tool that could be use? Is sshd using the HPN patch?
  • identify any known firewalls on the path
  • Determine the target performance level: What does the application need?
    • Look at the path and make sure it is reasonable, and capable of supporting the desired target rates.

2) Verify the performance tuning on both end systems

3) Determine the current network throughput

4) Check the level of packet loss on the path. The owping tool, part of owamp, is the best way to do this. E.G:

owping -c 10000 -i .01 hostname

5) Verify that the problem isn't across the WAN.

  • run pScheduler tests between capable servers closest to your end hosts on the WAN
    • See the perfSONAR Directory Service  for a list of public pScheduler servers. Note: some of these might not allow tests to/from everywhere.
  • Look at historical test results between pScheduler hosts.
    • ESnet Tests (note: ESnet regularly scheduled tests use 'scavenger service (tos=32)', and may show low performance on heavily used links.)
    • Internet2 Tests

If the WAN tests show a problem, contact your primary WAN provider.

5) Otherwise, run tests from both end hosts to a test host on the WAN.

  • Start with a close test system, and then get further and further away.
  • For example, If you are troubleshooting the path from NERSC in Oakland to the Texas A&M in College Station, TX, find some public pScheduler servers along the path:
  • From both end hosts, run bi-directional throughput tests to test hosts that are further and further away. E.G., from NERSC:
    pscheduler task throughput --dest nersc-endhost --source 
pscheduler task throughput --dest nersc-endhost --source
 pscheduler task throughput --dest nersc-endhost --source
  pscheduler task throughput --dest nersc-endhost --source
  pscheduler task throughput --dest nersc-endhost --source

Then reverse the source/dest hosts, and do the same thing from the TAMU host until you find the problem segment.

6) If performance is fine on segments where the RTT is less than around 20ms, but poor on longer paths, the problem is likely that there is a switch or router without enough packet buffering. In some cases this can be fixed by changing the configuration (e.g.: in Cisco 6509s), in other cases you may need to buy a better data cluster switch. A workaround for this is to use more parallel streams.