Sample Network Issues Discovered Using PerfSONAR
ESnet is actively promoting the deployment of perfSONAR at sites that have performance-significant data transfers, as well as in the networks that are involved in the end-to-end path, in order to characterize each segment of the end-to-end path.
Our experience is that on almost all of the paths where perfSONAR has been deployed, perfSONAR has revealed previously undetected significant bandwidth limiting problems, many of which are relatively easily resolved after being identified. These are all in the category of "soft failures", where the network is up, but throughput on the path is 3-10x slower than expected.
In our experience the Internet is rife with such soft failures. The networking community is good at detecting hard failures, but not good at detecting soft failures. perfSONAR (specifically bwctl and regular bwctl testing managed by and results published using perfSONAR-BUOY) is very good at detecting soft failures. It is still difficult to pinpoint the exact cause of these failures, but the more measurement points that exist, the easier locating the problem becomes.
Here are some examples of the types of soft failure that we have discovered only after bringing up a perfSONAR-based measurement host and collecting a few days worth of active measurement data:
- multiple cases of bad fibers
- port-forwarding filter overloading router and causing packet drops
- under-powered firewalls
- router output buffer tuning needed
- previously unnoticed asymmetric routing causing poor performance
- under-powered host (doubled performance by switching to jumbo frames)
The US Atlas project installed perfSONAR measurement servers at a number of sites, and configured bwctl to run tests every few hours. After a couple of days they noticed that for the path from the University of Michigan to Brookhaven National Laboratory, performance varied from 50-80 Mbps, but it was expected that this path should be capable of supporting 800 Mbps flows. The path traversed 4 networks: ESnet, Internet2, BNL, and UMich; any of which might have been the source of the trouble.
Luckily there are several perfSONAR measurement hosts along the path, so it was easy to eliminate potential sources of trouble. Regular tests from bnl-pt1.es.net (Brookhaven) to chic-pt1.es.net (Chicago) showed no problems. bnl-pt1.es.net to lhcmon.bnl.gov also showed no problems. However psum02.aglt2.org (Michigan) to chic-pt1.es.net showed that something was wrong with this segment of the path.
This problem was not an easy one to find. There were no error counters incrementally tabulating errors for this. It turns out that the Cisco Express Forwarding had an IPv4 fault status, probably due to a routing table overflow. A hard reset of this switch fixed the problem, and performance went to 900 Mbps, as shown in this plot.
While the current version of perfSONAR is very good at finding out the existence of problems, it can still be quite difficult to pinpoint the exact source of the problem. The more perfSONAR measurements points are made, the easier it is to locate the problems. We encourage all ESnet sites and DOE collaboration sites to deploy perfSONAR servers to help make troubleshooting easier.