I/O Benchmarking
Methodology
The goal of this benchmarking is to simulate as realistically as possible the workflow that the DTN is going to support, measure the performance and tune each of the subsystem, one by one, until reaching the expected performance, effectively removing all the bottlenecks. When each of the subsystems are correctly configured, run the application and measure the performance to see if there is any performance conflicts between the subsystems.
I/O
The elbencho storage benchmark along with the storage_sweep contribution provides a robust way to understand your storage subsystem performance across a wide range of file distributions. A container image is available to run a full storage sweep test and generate plot of the results for each data set. The container image and example usage may be found here.
Additionally, there are several other I/O benchmarking tools such as iometer and ioperf. While those tools are excellent and can simulate complex I/O workflows, they often require an infrastructure to run. Since some DTN workflows are often quite simple, parallel sequential read and write, it might be easier and faster to use simple tools, like the UNIX commands "dd" for generating I/O operations, and "vmstat" to measure performance. Note that if you run 2 tests back-to-back, the file from the first test will be cached. To flush the cache do this (as root):
sync; echo 3 > /proc/sys/vm/drop_caches
Simulating a single thread writing sequentially a file can be done using dd as:
$ dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432
if=/dev/zero: "if" stands for "input file". The reason for using /dev/zero as a source of data is that the performance of this pseudo-file is memory to memory speed. in other words, it will not interfere with the measurement of the write operation on the file system.
of=/storage/data1/file1: "of" stands for "output file". The destination file is on the performance file system.
bs=4k: "bs" stands for block size. Since normal POSIX read/write on the file system is by default using block of 4k, dd needs to be using the same block size in order to correctly model the application. Note that dd will buffers an amount of data equal to the block size argument before writing it out to disk.
count=33554432: count is the number of blocks that will be written. In order to measure correctly the performance, it is important to take into account the write cache: if the size of the file that is written is less than the memory, then dd will just measure the memory to memory performance. The rule of thumb is to always measure file that are at least twice the size of the memory. 33554432 blocks of 4 k counts for a file of 128GB (128*1024*1024/4)
Simulating sequentially reading a file is done by:
$ dd if=/storage/data1/file1 of=/dev/null bs=4k
This command will read the entire file, sequentially, using 4k blocks, and will put the result into /dev/null which has memory to memory performance.
A typical result output of a write is:
# dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432 33554432+0 records in
33554432+0 records out 137438953472 bytes (137 GB) copied, 922.659 seconds, 149 MB/s
Performance can be measured using the unix command "vmstat". While running the dd command, using a different shell window, run "vmstat 1". 1 means that there will be a measurement every second which make it easier for reading the results. A typical output is:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 1 0 27004232 23644 21547732 0 0 0 249 196 10 0 0 100 0 0 1 1 0 26994760 23648 21549696 0 0 4 214096 15276 86 0 9 84 7 0 0 1 0 27000588 23656 21551316 0 0 8 138160 15377 87 0 8 85 7 0 2 0 0 26999348 23664 21552544 0 0 8 172032 15332 106 0 8 85 7 0 2 0 0 26996372 23668 21555704 0 0 4 131072 14747 100 0 9 84 6 0 1 1 0 26994140 23672 21557352 0 0 4 131072 14679 127 0 9 84 7 0 1 1 0 26992784 23676 21558492 0 0 4 131072 14491 93 0 8 84 8 0 1 1 0 26991576 23680 21559780 0 0 4 131072 14563 102 0 8 85 7 0 1 1 0 26990096 23684 21561504 0 0 4 131072 14562 116 0 10 83 7 0 1 1 0 26988608 23688 21563048 0 0 4 131072 14718 101 0 9 84 7 0 1 1 0 26987656 23692 21561104 0 0 4 131072 16235 2942 0 9 82 8 0 0 1 0 26982848 23700 21565788 0 0 8 221184 16508 115 0 9 84 7 0 2 0 0 26981360 23704 21567344 0 0 4 172032 15299 101 0 8 84 7 0 ...
vmstat measures many aspects of the operating system. For I/O benchmarking, the "bo" and "bi" columns are useful: the report the actual rate of I/O from the memory to the disk (bo / "bytes out") or from disk to memory (bi / bytes in). The unit is KB. In this example and the results are aggregated. Looking the results, it is clear asides peaks at around 200MB/sec, the sustained performance is around 131 MB/sec which is a typical performance of a un-tuned system.
CPU Load
Another interesting measurement to look at is the CPU utilization. vmstat only reports aggregate results, and therefore, it is not very useful if only one core is max'ed out. the command "top" can be used. Once the command is running, type "1", and each core is shown:
top - 10:53:19 up 5:09, 2 users, load average: 0.52, 0.69, 0.64 Tasks: 248 total, 1 running, 247 sleeping, 0 stopped, 0 zombie Cpu0 : 0.3%us, 9.7%sy, 0.0%ni, 7.7%id, 82.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 86.9%sy, 0.0%ni, 0.0%id, 11.4%wa, 0.0%hi, 1.7%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49552856k total, 12171020k used, 37381836k free, 26420k buffers Swap: 0k total, 0k used, 0k free, 11470292k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20128 root 20 0 0 0 0 D 87.8 0.0 13:18.84 flush-8:0 21075 root 20 0 63184 604 512 S 9.9 0.0 0:15.91 dd 21128 root 20 0 12888 1244 828 R 0.3 0.0 0:00.07 top 1 root 20 0 10364 688 568 S 0.0 0.0 0:01.67 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.02 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.04 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 7 root 20 0 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 9 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 ...
In this example, we can see that dd and flush are the active processes. "flush" is actually an EXT4 thread that is reading from the write cache and write onto the disk. It is the worker thread that does it all. Looking at the core usage, we can see that core 0 and core 6 are quite busy, and in fact, core 6 is 100% busy. This means that core 6 is the bottleneck (this is the core that is mapped to the RAID controller interrupt and where flush is running). In other words, this dd is CPU bound.
Another useful tool to monitor CPU load is mpstat, which is part of the Linux systat package. e.g., run: mpstat -P ALL 1
Remember: It is easier to tune the system one knob at the time, run benchmark, and observe the effects.
To test a more realistic work flow, i.e. with several concurrent streams, can easily be done by running concurrent instance of dd using different file. For instance to simulate 4 read streams, run the following:
$ dd dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432 & $ dd dd if=/dev/zero of=/storage/data2/file2 bs=4k count=33554432 & $ dd dd if=/dev/zero of=/storage/data3/file3 bs=4k count=33554432 & $ dd dd if=/dev/zero of=/storage/data4/file4 bs=4k count=33554432 &
More information on using dd for disk benchmarking can be found in this article.