UDP Tuning

UDP will not get a full 10Gbps (or more) without some tuning as well. The important factors are:

  • use jumbo frames: performance will be 4-5 times better using 9K MTUs
  • packet size: best performance is MTU size minus packet header size. For example for a 9000Byte MTU, use 8972 for IPV4, and 8952 for IPV6.
  • socket buffer size: For UDP, buffer size is not related to RTT the way TCP is, but the defaults are still not large enough. Setting the socket buffer to 4M seems to help a lot in most cases
  • core selection: UDP at 10G is typically CPU limited, so its important to pick the right core. This is particularly true on Sandy/Ivy Bridge motherboards.

Sample commands for iperf, iperf3, and nuttnuttcp:

   nuttcp -l8972 -T30 -u -w4m -Ru -i1 -xc4/4 remotehost
iperf3 -l8972 -T30 -u -w4m -b0 -A 4,4 -c remotehost
numactl -C 4 iperf -l8972 -T30 -u -w4m -b10G -c remotehost

You may need to try different cores to find the best one for your host. You can use 'mpstat -P ALL 1' to figure out which core is being used for NIC interrupt handling, and then try a core in the same socket, but not the same core. Note that 'top' is not reliable for this.

In general nuttcp seems to be the fastest for UDP. Note that you'll need nuttcp V7.1 or higher to get the "-xc" option.

Even with this tuning, you'll need fast cores to get a full 10Gbps. For example, a 2.9GHz Intel Xeon CPU can get the full 10Gbps line rate, but with a 2.5GHz Intel Xeon CPU, we see only 5.9Gbps. The 2.9GHz CPU gets 22 Gbps of UDP using a 40G NIC. 

Processor architectures are being designed to facilitate aggregate capacity (e.g. more cores) at the expense of clock rate. Pushing the clock speeds higher has been problematic, while providing diminishing returns for some use cases. Many newer machines offer more cores, and this generally work great in VMs and for large numbers of small network flows. Single stream performance testing is one use case that benefits from fewer cores, and higher clock speeds.

If you want to see how much you can get with 2 UDP flows, each on a separate core, you can do something like this:

   nuttcp -i1 -xc 2/2 -Is1 -u -Ru -l8972 -w4m -p 5500 remotehost & \
nuttcp -i1 -xc 3/3 -Is2 -u -Ru -l8972 -w4m -p 5501 remotehost & \

Determining CPU Limitations

If you are running the commands above, but still don't see great performance, use 'mpstat -P ALL 1' to determine how much CPU is being used.  For example, here is a nuttcp test using the suggested command line options, and the result is 5.9 Gbps:

Note that nuttcp reports 99% CPU on the transmit host. mpstat on the recieving host confirms that core 6 is not saturated:

mpstat on the sending host confirms that core 6 is saturated: 

For these hosts, running multiple nuttcp clients on different cores will increase total throughput.