Fair Queuing Scheduler
Packet Pacing, TSO (TCP Segmentation Offload) sizing, and the FQ (Fair Queuing) scheduler
Starting with the Linux kernel 3.11 or higher (available in Fedora 20, Debian 8, and Ubuntu 13.10), there is a new 'fair queuing' scheduler, which includes code that does a much better job of pacing packets out of a fast host. See https://lwn.net/Articles/564978/ for more details. For RHEL-based OSes, FQ has been backported to the 3.10.0-327 kernel in v7.2.
For more information is configuring FQ is available here and here.
On some long paths (50-80ms RTT), we've seen TCP performance improvements of 2-4X, as shown below. More experimental results are available here.
In particular, FQ helps if there is a network device in the path with less than 32MB of per-port buffering.
To enable Fair Queuing (which is off by default), do:
tc qdisc add dev $ETH root fq
or to both pace and shape the bandwidth:
tc qdisc add dev $ETH root fq maxrate Ngbit

A plot of the tcpdump for these transfers clearly shows why throughput with FQ is better. The plot on the left is with FQ, and the plot on the right its without FQ.

Details on these results.
Test path: Fermi National Lab (near Chicago) to NERSC (Oakland CA)
FNAL Sender —> FNAL-S —> FNAL-R —> FNAL-BR—> STARLIGHT-R ——> NERSC-R —> NERSC Receiver
40G 100G 100G 100G 100G 100G 40G
S: switch; R: router; BR: border router
Both FNAL Sender and NERSC Receiver are configured with Mellanox 40GE NICs.
The FNAL-S 40GE line card has 4x10Gbps parallelism, instead of 1x40Gbps. Therefore, a single stream’s throughput is limited to10Gbps.
In a different test on the ESnet 40G testbed, the following results were produced. Here is with default settings:
iperf3 -c 10.20.1.20 -A2,2 -t50 -w512M
Connecting to host 10.20.1.20, port 5201
[ 4] local 10.20.1.8 port 52812 connected to 10.20.1.20 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 718 MBytes 6.02 Gbits/sec 0 43.5 MBytes
[ 4] 1.00-2.00 sec 1.99 GBytes 17.1 Gbits/sec 738 213 MBytes
[ 4] 2.00-3.00 sec 2.34 GBytes 20.1 Gbits/sec 0 213 MBytes
[ 4] 3.00-4.00 sec 2.16 GBytes 18.6 Gbits/sec 0 214 MBytes
[ 4] 4.00-5.00 sec 2.24 GBytes 19.3 Gbits/sec 0 215 MBytes
[ 4] 5.00-6.00 sec 2.19 GBytes 18.8 Gbits/sec 0 218 MBytes
[ 4] 6.00-7.00 sec 2.30 GBytes 19.8 Gbits/sec 0 221 MBytes
[ 4] 7.00-8.00 sec 2.25 GBytes 19.3 Gbits/sec 0 226 MBytes
[ 4] 8.00-9.00 sec 2.40 GBytes 20.6 Gbits/sec 0 231 MBytes
[ 4] 9.00-10.00 sec 2.36 GBytes 20.3 Gbits/sec 0 238 MBytes
[ 4] 10.00-11.00 sec 2.53 GBytes 21.7 Gbits/sec 0 245 MBytes
[ 4] 11.00-12.00 sec 2.51 GBytes 21.6 Gbits/sec 0 254 MBytes
[ 4] 12.00-13.00 sec 2.69 GBytes 23.1 Gbits/sec 0 263 MBytes
[ 4] 13.00-14.00 sec 2.72 GBytes 23.3 Gbits/sec 0 274 MBytes
[ 4] 14.00-15.00 sec 2.88 GBytes 24.8 Gbits/sec 0 285 MBytes
[ 4] 15.00-16.00 sec 2.96 GBytes 25.4 Gbits/sec 0 297 MBytes
[ 4] 16.00-17.00 sec 3.11 GBytes 26.7 Gbits/sec 0 309 MBytes
[ 4] 17.00-18.00 sec 3.22 GBytes 27.7 Gbits/sec 0 322 MBytes
[ 4] 18.00-19.00 sec 3.38 GBytes 29.0 Gbits/sec 0 334 MBytes
[ 4] 19.00-20.00 sec 3.48 GBytes 29.8 Gbits/sec 0 348 MBytes
[ 4] 20.00-21.00 sec 3.58 GBytes 30.7 Gbits/sec 0 360 MBytes
And here is with FQ on:
iperf3 -c 10.20.1.20 -A2,2 -t50 -w512M
Connecting to host 10.20.1.20, port 5201
[ 4] local 10.20.1.8 port 52824 connected to 10.20.1.20 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 709 MBytes 5.95 Gbits/sec 0 35.0 MBytes
[ 4] 1.00-2.00 sec 2.50 GBytes 21.4 Gbits/sec 0 885 MBytes
[ 4] 2.00-3.00 sec 3.55 GBytes 30.5 Gbits/sec 0 885 MBytes
[ 4] 3.00-4.00 sec 3.55 GBytes 30.5 Gbits/sec 0 885 MBytes
[ 4] 4.00-5.00 sec 3.54 GBytes 30.4 Gbits/sec 0 885 MBytes
[ 4] 5.00-6.00 sec 3.54 GBytes 30.4 Gbits/sec 0 885 MBytes
[ 4] 6.00-7.00 sec 3.54 GBytes 30.4 Gbits/sec 0 885 MBytes
[ 4] 7.00-7.72 sec 2.54 GBytes 30.4 Gbits/sec 0 885 MBytes
Note that with FQ on, there is no burst of retransmits at the beginning, and it ramps up to full speed quickly.
Also note that for a 1500B MTU, just disabling TSO (more information available here) can lead to a 2x improvment on this path.

