DTN Tuning
Tuning your DTN host is extremely important. We have seen overall IO throughput of a DTN more than double with proper tuning.
Tuning can be as much art as a science. Due to differences in hardware, it's hard to give concrete running advice. In general you should attempt to tune one thing at a time, and runs some benchmarks to see if it made a difference. Some sample benchmark commands are shown here.
Here are some tuning settings that we have found do make a difference. This page assumes you are running a Redhat-based Linux system, but other types of Unix should have similar tuning nobs. Note that you should always use the most recent version of the OS, as performance optimizations for new hardware are added to every release.
Additional information on tuning for 40/100G hosts can be found here.
Network
Network tuning is the most important thing to pay attention to. Be sure to following the the advice in our Linux Tuning Guide, and also pay attention our are NIC tuning advice as well. We also recommend configuring FQ-based packet pacing. If you are trying to get as much bandwidth as possible out of your DTN, you'll also want to do Interrupt Binding.
Packet Pacing
DTN testing by ESnet and several others has confirmed that you will reduce packet loss and maximize DTN throughput by using packet pacing to throttle network traffic to 80-90% of the NIC speed. For example for a 10G NIC running GridFTP doing 4 parallel streams, we strongly recommend using 'tc' to set the per-stream maximum to 2Gbps. e.g.:
/sbin/tc qdisc add dev ethN root fq maxrate 2gbit
BIOS
For PCI gen3 and gen4-based hosts, you should enable “turbo boost”, and disable hyperthreading and node interleaving. More information on BIOS tunings are described in these documents: AMD and Intel (2.1 BIOS Recommendations).
If enabling IOMMU is desired for virtualization, ensure that iommu=pt is configured as a kernel boot option to avoid mapping overheads on the native host.
I/O Scheduler
The default scheduler on some versions of Linux is the "fair" scheduler. For a DTN node, we recommend using the "deadline" scheduler instead. To enable deadline scheduling, add "elevator=deadline" to the end of the "kernel' line in your /boot/grub/grub.conf file, similar to this:
kernel /vmlinuz-2.6.35.7 ro root=/dev/VolGroup00/LogVol00 rhgb quiet elevator=deadline
File System
We recommend using the ext4 file system in Linux for DTN nodes. xfs is also emerging as a viable option for DTNs particularly when parallel I/O is important.
Increasing the amount of "readahead" usually helps on DTN nodes where the workflow is mostly sequential reads. However you should definitely test this, as some RAID controllers do this already, and changing this may have adverse affects. Setting readahead should be done at system boot time. For example, add something like this to /etc/rc.local:
/sbin/blockdev --setra 262144 /dev/sdb
More information on readahead is in this paper on Linux 2.6 performance improvement through readahead optimization.
EXT4 Tuning
In order to operate optimally when operating on RAID systems, the file system should be tuned to the physical layout of the drives. Stride and stripe-width are used to align the volume according to the stripe-size of the RAID.
- stride is calculated as Stripe Size / Block Size.
- stripe-width is calculated as stride * Number of Disks Providing Capacity.
Disabling journaling will also improve performance, but reduces reliability. More information on tuning ext4 for RAID can be found here.
Sample mkfs command:
/sbin/mkfs.ext4 /dev/sdb1 –b 4096 –E stride=64 stripe-width=768 –O ^has_journal
There are also tuning settings that are done at mount time. Here are the ones that we have found improve DTN performance:
- data=writeback – this option forces ext4 to use journaling only for metadata. This gives a huge improvement in write performance
- inode_readahead_blks=64 – this specifies the number of inode blocks to be read ahead by ext4’s readahead algorithm. Default is 32.
- Commit=300 – this parameter tells ext4 to sync its metadata and data every 300 s. This reduces the reliability of data writes, but increases performance.
- noatime,nodiratime – these parameters tells ext4 not to write the file and directory access timestamps.
Sample fstab entry:
/dev/sdb1 /storage/data1 ext4 inode_readahead_blks=64,data=writeback,barrier=0,commit=300,noatime,nodiratime
More information on ext4 options can be found here.
XFS configuration and tuning
mkfs defaults are recommended. However, there are guidelines for matching RAID stripe unit and stripe width to optimize for performance.
RAID Controller
Different RAID controllers provide different tuning controls.Check the documentation for your controller and use the settings recommended to optimize for large file reading. You will usually want to disable any “smart” controller built-in options, as they are typically designed for different workflows.
Here are some settings that we found increase performance on a 3ware RAID controller. These settings are in the BIOS, and can be entered by pressing Alt+3 when the system boots up.
- Write cache – Enabled
- Read cache – Enabled
- Continue on Error – Disabled
- Drive Queuing – Enabled
- StorSave Profile – Performance
- Auto-verify – Disabled
- Rapid RAID recovery – Disabled
Virtual memory Subsystem
Setting dirty_background_bytes and dirty_bytes improves write performance. For our system, the settings that gave best performance was:
echo 1000000000 > /proc/sys/vm/dirty_bytes
echo 1000000000 > /proc/sys/vm/dirty_background_bytes
On heavily used DTN's we've seen cases where the host will run out of memory and give an error such as:
SLUB: Unable to allocate memory on node
Reserving about 5% of the RAM for the VM subsystem using vm.min_free_kbytes seems to fix the problem.
For example, for a host with 96MB of RAM, add the following to /etc/sysctl.conf to set min_free to 4MB:
vm.min_free_kbytes = 4096000
More information on the Linux VM subsystem is here.
SSD Tuning
Tuning your SSD is more about reliability and longevity than performance, as each flash memory cell has a finite lifespan
that is determined by the number of "program and erase (P/E)" cycles. Without proper tuning, SSD can perform less than a a traditional HD, and can die within months.
And remember, never do "write" benchmarks on SSD: this will damage your SSD quickly.
Modern SSD drives and modern OSes should all includes TRIM support, which is important to prolong the live of your SSD. Only the newest RAID controllers include TRIM support (late 2012). This article explains why trim is important and how it works:
Swap
To prolong SSD lifespan, do not swap on an SSD. In Linux you can control this using the sysctl variable vm.swappiness. For example, add this to /etc/sysctl.conf:
vm.swappiness=1
This tells the kernel to avoid unmapping mapped pages whenever possible.
RAMDISK
To avoid frequent re-writing of files (for example during compiling code from source), use a ramdisk file system (tmpfs) for /tmp /usr/tmp, etc.
ext4 file system tuning for SSD
These mount flags should be used for SSD partitions.
- noatime: Reading accesses to the file system will no longer result in an update to the atime information associated with the file. This eliminates the need for the system to make writes to the file system for files which are simply being read.
- discard: This enables the benefits of the TRIM command as long for kernel version >=2.6.33.
Sample /etc/fstab entry:
/dev/sda1 /home/data ext4 defaults,noatime,discard 0 1
For more information see performance tuning for SSD in Linux.