I/O Benchmarking

Methodology

The goal of this benchmarking is to simulate as realistically as possible the workflow that the DTN is going to support, measure the performance and tune each of the subsystem, one by one, until reaching the expected performance, effectively removing all the bottlenecks. When each of the subsystems are correctly configured, run the application and measure the performance to see if there is any performance conflicts between the subsystems.

I/O

The elbencho storage benchmark along with the storage_sweep contribution provides a robust way to understand your storage subsystem performance across a wide range of file distributions. A container image is available to run a full storage sweep test and generate plot of the results for each data set. The container image and example usage may be found here.

Additionally, there are several other I/O benchmarking tools such as iometer and ioperf. While those tools are excellent and can simulate complex I/O workflows, they often require an infrastructure to run. Since some DTN workflows are often quite simple, parallel sequential read and write, it might be easier and faster to use simple tools, like the UNIX commands "dd" for generating I/O operations, and "vmstat" to measure performance. Note that if you run 2 tests back-to-back, the file from the first test will be cached. To flush the cache do this (as root):

sync; echo 3 > /proc/sys/vm/drop_caches

Simulating a single thread writing sequentially a file can be done using dd as:

$ dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432

if=/dev/zero: "if" stands for "input file". The reason for using /dev/zero as a source of data is that the performance of this pseudo-file is memory to memory speed. in other words, it will not interfere with the measurement of the write operation on the file system.

of=/storage/data1/file1: "of" stands for "output file". The destination file is on the performance file system.

bs=4k: "bs" stands for block size. Since normal POSIX read/write on the file system is by default using block of 4k, dd needs to be using the same block size in order to correctly model the application. Note that dd will buffers an amount of data equal to the block size argument before writing it out to disk.

count=33554432: count is the number of blocks that will be written. In order to measure correctly the performance, it is important to take into account the write cache: if the size of the file that is written is less than the memory, then dd will just measure the memory to memory performance. The rule of thumb is to always measure file that are at least twice the size of the memory. 33554432 blocks of 4 k counts for a file of 128GB (128*1024*1024/4)

Simulating sequentially reading a file is done by:

$ dd if=/storage/data1/file1 of=/dev/null bs=4k

This command will read the entire file, sequentially, using 4k blocks, and will put the result into /dev/null which has memory to memory performance.

A typical result output of a write is:

# dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432
33554432+0 records in
33554432+0 records out
137438953472 bytes (137 GB) copied, 922.659 seconds, 149 MB/s

Performance can be measured using the unix command "vmstat". While running the dd command, using a different shell window, run "vmstat 1". 1 means that there will be a measurement every second which make it easier for reading the results. A typical output is:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 27004232  23644 21547732    0    0     0   249  196   10  0  0 100  0  0
 1  1      0 26994760  23648 21549696    0    0     4 214096 15276   86  0  9 84  7  0
 0  1      0 27000588  23656 21551316    0    0     8 138160 15377   87  0  8 85  7  0
 2  0      0 26999348  23664 21552544    0    0     8 172032 15332  106  0  8 85  7  0
 2  0      0 26996372  23668 21555704    0    0     4 131072 14747  100  0  9 84  6  0
 1  1      0 26994140  23672 21557352    0    0     4 131072 14679  127  0  9 84  7  0
 1  1      0 26992784  23676 21558492    0    0     4 131072 14491   93  0  8 84  8  0
 1  1      0 26991576  23680 21559780    0    0     4 131072 14563  102  0  8 85  7  0
 1  1      0 26990096  23684 21561504    0    0     4 131072 14562  116  0 10 83  7  0
 1  1      0 26988608  23688 21563048    0    0     4 131072 14718  101  0  9 84  7  0
 1  1      0 26987656  23692 21561104    0    0     4 131072 16235 2942  0  9 82  8  0
 0  1      0 26982848  23700 21565788    0    0     8 221184 16508  115  0  9 84  7  0
 2  0      0 26981360  23704 21567344    0    0     4 172032 15299  101  0  8 84  7  0
  ...

vmstat measures many aspects of the operating system. For I/O benchmarking, the "bo" and "bi" columns are useful: the report the actual rate of I/O from the memory to the disk (bo / "bytes out") or from disk to memory (bi / bytes in). The unit is KB. In this example and the results are aggregated. Looking the results, it is clear asides peaks at around 200MB/sec, the sustained performance is around 131 MB/sec which is a typical performance of a un-tuned system.

CPU Load

Another interesting measurement to look at is the CPU utilization. vmstat only reports aggregate results, and therefore, it is not very useful if only one core is max'ed out. the command "top" can be used. Once the command is running, type "1", and each core is shown:

top - 10:53:19 up  5:09,  2 users,  load average: 0.52, 0.69, 0.64
Tasks: 248 total,   1 running, 247 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  9.7%sy,  0.0%ni,  7.7%id, 82.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us, 86.9%sy,  0.0%ni,  0.0%id, 11.4%wa,  0.0%hi,  1.7%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49552856k total, 12171020k used, 37381836k free,    26420k buffers
Swap:        0k total,        0k used,        0k free, 11470292k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                 
20128 root      20   0     0    0    0 D 87.8  0.0  13:18.84 flush-8:0                                                                                                
21075 root      20   0 63184  604  512 S  9.9  0.0   0:15.91 dd                                                                                                       
21128 root      20   0 12888 1244  828 R  0.3  0.0   0:00.07 top                                                                                                      
    1 root      20   0 10364  688  568 S  0.0  0.0   0:01.67 init                                                                                                     
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                 
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.02 migration/0                                                                                              
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.04 ksoftirqd/0                                                                                              
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                                                                                               
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1                                                                                              
    7 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/1                                                                                              
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                                                                                               
    9 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2                                                                                              
   10 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2                                                                                              
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                                                                                               
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3                                                                                              
   ...

In this example, we can see that dd and flush are the active processes. "flush" is actually an EXT4 thread that is reading from the write cache and write onto the disk. It is the worker thread that does it all. Looking at the core usage, we can see that core 0 and core 6 are quite busy, and in fact, core 6 is 100% busy. This means that core 6 is the bottleneck (this is the core that is mapped to the RAID controller interrupt and where flush is running). In other words, this dd is CPU bound.

Another useful tool to monitor CPU load is mpstat, which is part of the Linux systat package. e.g., run: mpstat -P ALL 1

Remember: It is easier to tune the system one knob at the time, run benchmark, and observe the effects.

To test a more realistic work flow, i.e. with several concurrent streams, can easily be done by running concurrent instance of dd using different file. For instance to simulate 4 read streams, run the following:

    $ dd dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432 &
    $ dd dd if=/dev/zero of=/storage/data2/file2 bs=4k count=33554432 &
    $ dd dd if=/dev/zero of=/storage/data3/file3 bs=4k count=33554432 &
    $ dd dd if=/dev/zero of=/storage/data4/file4 bs=4k count=33554432 &

More information on using dd for disk benchmarking can be found in this article.