ESnet Guide to Diagnosing Performance Problems
The following is an example step-by-step guide to track down a problem. These steps may or may not work for you, as these assume you have access to both end systems, have the ability to install pScheduler and iperf, and have a few perfSONAR hosts along your path of interest. End-to-end troubleshooting is MUCH harder without intermediate test hosts.
1) Scope the problem
- Identify the Data Source IP address & host system: OS, version, distribution, hardware, etc.
- Identify the Destination IP address & host system: OS, version, distribution, hardware, etc.
- Identify the data transfer tool used. Is there a better tool that could be use? e.g.: Is ssh using the HPN patch?
- identify any known firewalls on the path
- Determine the target performance level: What does the application need?
- Look at the path and make sure it is reasonable, and capable of supporting the desired target rates.
2) Verify the performance tuning on both end systems
3) Determine the current network throughput
4) Check the level of packet loss on the path. The owping tool is the best way to do this. E.G:
owping -c 10000 -i .01 hostname
5) Verify that the problem isn't across the WAN.
- run pScheduler tests between capable servers closest to your end hosts on the WAN
- See the perfSONAR Directory Service for a list of public pScheduler servers. Note: some of these might not allow tests to/from everywhere.
- Look at historical test results between pScheduler hosts.
If the WAN tests show a problem, contact your primary WAN provider.
5) Otherwise, run tests from both end hosts to a test host on the WAN.
- Start with a close test system, and then get further and further away.
- For example, If you are troubleshooting the path from NERSC in Oakland to the Texas A&M in College Station, TX, find some public pScheduler servers along the path:
- nersc-pt1.es.net, sunn-pt1.es.net, chic-pt1.es.net, ps1-hardy-hstn.tx-learn.net
- From both end hosts, run bi-directional throughput tests to test hosts that are further and further away. E.G., from NERSC:
pscheduler task throughput --dest nersc-endhost --source nersc-pt1.es.net
pscheduler task throughput --dest nersc-endhost --source sunn-pt1.es.net
pscheduler task throughput --dest nersc-endhost --source chic-pt1.es.net
pscheduler task throughput --dest nersc-endhost --source ps1-hardy-hstn.tx-learn.net
pscheduler task throughput --dest nersc-endhost --source psonar-tput.brazos.tamu.edu
Then reverse the source/dest hosts, and do the same thing from the TAMU host until you find the problem segment.
6) If performance is fine on segments where the RTT is less than around 20ms, but poor on longer paths, the problem is likely that there is a switch or router without enough packet buffering. In some cases this can be fixed by changing the configuration (e.g.: in Cisco 6509s), in other cases you may need to buy a better data cluster switch. A workaround for this is to use more parallel streams.