Troubleshooting Overview

January 14, 2025

ESnet Guide to Diagnosing Performance Problems

The following is an example step-by-step guide to track down a problem. These steps may or may not work for you, as these assume you have access to both end systems, have the ability to install pScheduler and iperf, and have a few perfSONAR hosts along your path of interest. End-to-end troubleshooting is MUCH harder without intermediate test hosts.

1) Scope the problem

Identify the Data Source IP address & host system: OS, version, distribution, hardware, etc.
Identify the Destination IP address & host system: OS, version, distribution, hardware, etc.
Identify the data transfer tool used. Is there a better tool that could be use? e.g.: Is ssh using the HPN patch?
identify any known firewalls on the path
Determine the target performance level: What does the application need?
- Look at the path and make sure it is reasonable, and capable of supporting the desired target rates.

2) Verify the performance tuning on both end systems

3) Determine the current network throughput

4) Check the level of packet loss on the path. The owping tool is the best way to do this. E.G:

owping -c 10000 -i .01 hostname

5) Verify that the problem isn't across the WAN.

run pScheduler tests between capable servers closest to your end hosts on the WAN
- See the perfSONAR Directory Service for a list of public pScheduler servers. Note: some of these might not allow tests to/from everywhere.
Look at historical test results between pScheduler hosts.
- ESnet Tests
- Internet2 Tests

If the WAN tests show a problem, contact your primary WAN provider.

5) Otherwise, run tests from both end hosts to a test host on the WAN.

Start with a close test system, and then get further and further away.
- For example, If you are troubleshooting the path from NERSC in Oakland to the Texas A&M in College Station, TX, find some public pScheduler servers along the path:
  - sunn-ps-tp.es.net, kans-ps-tp.es.net, ps1-hardy-hstn.tx-learn.net
From both end hosts, run bi-directional throughput tests to test hosts that are further and further away. E.G., from NERSC:

    pscheduler task throughput --dest nersc-endhost --source sunn-ps-tp.es.net 
    pscheduler task throughput --dest nersc-endhost --source kans-ps-tp.es.net 
    pscheduler task throughput --dest nersc-endhost --source ps1-hardy-hstn.tx-learn.net
    pscheduler task throughput --dest nersc-endhost --source psonar-tput.brazos.tamu.edu

Then reverse the source/dest hosts, and do the same thing from the TAMU host until you find the problem segment.

6) If performance is fine on segments where the RTT is less than around 20ms, but poor on longer paths, the problem may be a switch or router without enough packet buffering. In some cases this can be fixed by changing the router configuration, in other cases you may need to buy a better data cluster switch. A workaround for this is to use more parallel streams.