There are a number of possible causes for packet loss, some are easy to address, some much more difficult. These include:
- nework congestion
- under-buffered switch dropping packets
- under-powered firewall dropping packets
- dirty fibers or connectors
- overloaded or slow receive host dropping packets
It is easy to determine if there is packet loss on the path using iperf3 (if you see retransmits, then there is loss) or using owping. You can also use tcpdump/tcptrace to get a detailed view of what TCP sees.
Once you determine that packet loss is the problem, determining the cause of the loss can be tricky. Try running tests to other sites to try to isolate which end site is the source of the problem. Look for other perfSONAR nodes along the path and try to isolate the segment with loss. Contact site networking staff at each end to find out if the networks are congested, or if there is a firewall or low-end switch that might be the source of the problem.
The following are some things to try. Note that sometimes these techniques don't work or lead to the wrong conclusions. But often they work, and can be a good starting point.
The best way to to determine of the path is congested is to look at SNMP plots. If you dont have access to SNMP data, look at long term owamp plots. If you see regular, periodic loss events at the same time of day, there is probably congestion. An example is shown in this owamp plot.
The best way to find dirty fiber is to run a 2-3Gbps UDP test using iperf3 or nuttcp. If you are not allowed to run UDP tests, look at the retransmit profile for iperf3 TCP tests. If there is consistent low level of loss, this might be due to a dirty fiber. Check both directions, as dirty fiber issues are often just in one direction.
Underbuffered switch in the path
nuttcp burst mode testing is a good way to discover under-buffered switches/routers.
Underpowered Firewall in the path
If you see a constant rate of TCP retransmits at all times of day in one direction, this might be due to an under-powered firewall. (link to sample plot). It can be hard to distingish this from under-buffered switch issues, so you'll need to talk to the end site network administrators to confirm.
Receive Host issues
If the command 'ethtool -S ethN | grep rx_over_errors' on the receive end shows increasing errors, then your host is not fast enough to handle the incoming packet rate.