Case Study: Improving performance for US Atlas
The US Atlas project installed perfSONAR measurement servers at a number of sites, and configured bwctl to run tests every few hours. After a couple days they noticed that for the path from the University of Michigan to Brookhaven National Laboratory, performance varied from 50-80 Mbps, where expectations were that this path should be capable of supporting 800 Mbps flows. The path traversed 4 networks: ESnet, Internet2, BNL, and UMich; any of which might have been the source of the trouble.
Luckily there are several perfSONAR measurement hosts along the path, so it was easy to eliminate potential sources of trouble. Regular tests from bnl-pt1.es.net (Brookhaven) to chic-pt1.es.net (Chicago) showed no problems. bnl-pt1.es.net to lhcmon.bnl.gov also showed no problems. However psum02.aglt2.org (Michigan) to chic-pt1.es.net showed that something was wrong with this segment of the path.
This problem was not an easy one to find. There were no error counters incrementing for this. It turns out that the Cisco Express Forwarding had an IPv4 fault status, probably due to a routing table overflow. A hard reset of this switch fixed the problem, and performance went to 900 Mbps, as shown in this plot.