Buffer Issues

A common source of packet loss is network devices that can not handle larger bursts of packets. These devices include switches, routers, and security appliances. If you experience moderate loss (e.g.: 300-1000 packets in a 10 second test) it is likely due to a small buffer somewhere in the path.

There are two ways to test if you have such a device in your path:

use packet pacing to see if there is a rate where packet drops go away
use bursts of UDP packets, to see if large bursts cause packet loss.

Knowing that there is a device in your end-to-end path that is dropping packets due to small buffers is only step one. Figuring out which device is dropping packets is much more difficult.

Note: Don't try to get around this type of packet loss by using multiple un-paced streams, as you'll likely just cause even more loss.

Method 1: FQ (fair queuing) - Based Pacing

The easiest way is to test using FQ-based pacing. This requires a RHEL 7.2+, Fedora 20+, Debian 8+, or Ubuntu 13.10+ based OS.

FQ-based pacing does two things:

rate limit the traffic, and
smooth out the bursts of packets, so as not to overrun the (possibly) small buffer somewhere along the path.

Starting with iperf version 3.1.5, FQ-based pacing is supported via the "--fq-rate" flag. This version is part of the perfSONAR tools package, and used by default in the tool pScheduler.

perfsonar-tools installation instructions: (CentOS) (Debian)

First run a test to confirm that your path is clean.

   iperf3 -c remote_host

Look at the "Retr" count in the test output. E.G.:

[ ID] Interval           Transfer     Bandwidth       Retr
[ 14]   0.00-10.00  sec  2.73 GBytes  2345 Mbits/sec  1536

You can use the perfSONAR directory service to find test endpoints.

If you see a lot of packet loss, try reducing the sending rate and see if there is a point where the loss goes away. For example:

   iperf3 -c remote_host --fq-rate 15G
   iperf3 -c remote_host --fq-rate 10G
   iperf3 -c remote_host --fq-rate 8G

A graph showing sample results is on the right.

If you get loss at 15G, but not at 8G, then there is probably a device somewhere in the path with buffers that can not handle flows of that size. You should pace all of your traffic to the speed that is loss free, and then use enough parallel streams to fill the pipe to your desired transfer rate.

This is how to configure your host to use the loss-free rate as a max sending rate:

/sbin/tc qdisc add dev ethN root fq maxrate 8gbit

This will pace all traffic to all destinations, which may only make sense if the device with small buffers is near the sender. More details on how to configure pacing are here.

If you are still running CentOS 6, FQ-based packet pacing is not supported. You need to upgrade CentOS7 or later.

Method 2: use nuttcp UDP burst mode

This method requires you to have a login on both endpoints, as it is not supported using pScheduler.

The idea is to try different burst sizes, and see if there is a burst size where packet loss starts.

For example, the following commands were used to identify a path with buffers that are too small:

If this has no packet loss:

   nuttcp -l8972 -T30 -u -w4m -Ri300m/100 -i1 server_hostname

And this has packet loss:

   nuttcp -l8972 -T30 -u -w4m -Ri300m/300 -i1 server_hostname

Then likely there is a device with buffers less than 32MB per port buffering in the path.

Note that 8972 is the packet size for a 9000 byte MTU, minus IPV4 headers. Use smaller values for standard 1500 byte MTUs or IPV6.