perfSONAR Testing to Cloud Resources
perfSONAR Testing to Cloud Resources
(NOTE: this page is out of date, and needs to be updated!)
The use of commercial cloud resources, from providers such as Amazon and Microsoft, is an attractive alternative for research groups. The cost of purchasing and maintaining private clusters, that may go unused for large portions of time, is replaced with an on-demand model. Scientific data can be transmitted into cloud resources, processed, and results can be easily exported. The key component to success is the ability to seamlessly and efficiently transfer data to and from these resources; complications arise with data mobility if networking and transfer infrastructure is not capable of handling demand.
perfSONAR is an infrastructure used to measure network performance on an end-to-end basis, and it was designed to isolate problems that impact network use. For example, packet loss is known to slow down protocols such as TCP (widely used in data movement tools). perfSONAR can detect this packet loss, alert operations staff, and assist in debugging long paths to eliminate performance abnormalities.
ESnet and NCBI at the National Institutes of Health, evaluated network performance between R&E networks and an Amazon cloud instantiation of the perfSONAR Toolkit software in October of 2014. The evaluation team found that performance is variable, but overall meets expectations for activities such as bulk data movement. Technology barriers still exist, such as the use of Network Address Translation (NAT), the use of virtual hardware instead of a bare metal machine for measurement, and performance being highly tied to the size of virtual resources that are purchased.
Results from Amazon
A perfSONAR dashboard was created to test performance from three key locations on ESnet to an instance running in the Amazon cloud, localized to the Amazon datacenter in Ashburn, Virginia.
The 3 locations on ESnet were:
- Boston, MA
- Boise, ID
- Berkeley, CA
These were chosen due to their respective latency from the Amazon test instance, along with the general availability of measurement slots for additional testing. Each location houses two performance testers: one connected at 1Gbps for latency testing and one connected at 10Gbps for bandwidth testing. A snapshot of the dashboard appears below:
A general set of trends emerged in testing, related to both the tools used and structure of the tests. BWCTL, invoking Iperf3, was used for throughput testing. This is a single stream of TCP, run every 6 hours, for 20 total seconds (implying that the average performance over the 20 seconds, which includes time for TCP slow start) is factored into the end result. As latency increases, this has an impact on observed results, as the TCP auto-tuning algorithms require more time to stabilize and perform well. Other trends that emerged:
- The Amazon virtual host (purchased as a size medium) was restricted in terms of the speed in which it could send and receive
- The ESnet throughput test hosts were 10Gbps, which resulted in variable performance when testing to the slower Amazon hosts.
The test results shown below reflect the observations: when Amazon is sending to ESnet the results are stable and between 800Mbps and 1Gbps on average. Dips in performance due to congestion are possible, due to the single stream nature of the testing. When ESnet sends to Amazon, the results are more variable, and lower, due to the "impedance mismatch." Note that this observation is not a reflection of a network problem - any host sending at 10Gbps to something smaller may see the same behavior unless there is sufficient buffering in the path to accept the higher transfer rate.
Additional plots showing the variability of the tests are shown below. These box plots show the variability in testing a bit more clearly: the constrained range when Amazon is the sender is due to choice of virtual machine size (e.g. medium), which limits inbound and outbound network performance. When ESnet is sending to amazon, the range is more variable due to impedance factors.
The tests are conclusive in showing that the network infrastructure between ESnet and Amazon (fostered through a peering relationship between the carriers) is capable of high levels of performance as observed by the perfSONAR measurement tools. Researchers using these cloud resources, and exercising a direct peering, can have moderate expectations when it comes to data mobility if:
- Their end systems are tuned, and located on fast campus networks free of packet loss (e.g. as recommended in the Science DMZ architecture)
- They purchase adequate sized machines from the cloud provider. In this case of testing, medium functioned well, but it is expected that larger sizes would perform better.
Additional testing to determine the legitimacy of virtualized perfSONAR measurements, versus the typical requirement of dedicated server hardware, was not determined during this testing. Cloud environments are typically all virtual; meaning the provisioning of a dedicated server for testing is problematic. The results that were gathered seem to indicate that from a networking perspective, high throughput is possible to virtual hardware.