A note on using scp, sftp, and rsync
In a Unix environment scp, sftp, and rsync are commonly used to copy
data between hosts. While these tools work fine in a local environment,
they perform poorly on a WAN. The openssh
versions of scp and sftp have a
built in 1 MB buffer (previously only 64 KB in openssh older than
version 4.7) that severely limits performance on a WAN. Even though
rsync is not part of the openssh distribution, rsync typically uses ssh
as transport and is therefore subject to the limitations imposed by the
underlying ssh implementation. DO NOT USE THESE TOOLS if you need to
transfer large data sets across a network path with a RTT of more than
around 25ms.
A patch to fix this problem is available from the
Pittsburgh Supercomputer Center.
This patch makes it possible to optimize single stream performance on a WAN. However, to fully optimize
bulk data transfers over a WAN we recommend using one of the parallel stream tools described below. More information on sftp
tuning is described below as well.
Selecting a File Transfer Tool
When selecting a file transfer tool, one of the first things to decide is which security model you require.
The basic set of options are:
- anonymous: (e.g.: FTP, HTTP) anyone can access the data
- simple password: (e.g.: FTP, HTTP) most sites no longer allow this method since the password can be easily captured
- password encrypted: (e.g.: bbcp, bbftp, GridFTP, FDT) control channel is encrypted, but data is unencrypted
- everything encrypted: (e.g.: scp, sftp, rsync over ssh, GridFTP, HTTPS-based web server) both control and data channels are encrypted
In general, most open science projects seem to prefer option #3.
If you require option #4 over a WAN, the choice of tools that perform well over a WAN is limited to GridFTP with X509 keys, or
possibly HPN-patched ssh tools (e.g. HPN-patched scp/sftp or rsync over HPN-patched ssh).
The other issue is whether or not you have the requirement and/or ability to set up a server. HTTP and FTP
based tools require a system administrator to install a web or FTP server. Other tools such
as bbcp and GridFTP only require an sshd server to be installed by the administrator, and everything else
can be install as a normal user.
In order to obtain maximum throughput over a high-speed WAN, one needs use a file transfer
tool that includes the following features:
- Parallel data streams
- Ability to set the TCP buffer size
- Luckily this is becoming less important, as Linux, FreeBSD, OSX, and Windows Vista
now all include "TCP buffer autotuning". Other OSes will likely follow.
Note that is it still important to
increase the maximum TCP buffer size
even on a system that does TCP autotuning.
There are a large number of
file transfer programs available, but unfortunately
almost none of them provide both of these features.
The following are some commonly used tools that do provide these features.
GridFTP: part of the
Globus Toolkit.
globus-url-copy -p 4 -tcp-bs 16M sshftp://data.lbl.gov/home/mydata/myfile file://home/mydir/myfile
To install GridFTP with ssh support, see our Quick Start Guide.
There is also a
Microsoft .NET version
of the GridFTP client and server available from the University of Virginia, and
a Firefox extension that uses
GridFTP from the University of Delaware.
FDT: Java-based tool from Caltech
java -jar fdt.jar [ OPTIONS ] [[[user@][host1:]]file1 [[[user@][host2:]]file2
bbftp: from the Babar Project
bbftp -p 4 -e 'setrecvwinsize 1024; setsendwinsize 1024; put myfile' -E '/usr/local/bin/bbftpd -s' remotehost
bbcp: from SLAC
bbcp -P 4 -v -w 2M myfile remotehost:filename
More info on using bbcp is available from Caltech.
nuttscp: from NRL
This is a simple perl script wrapper that uses ssh and the nuttcp tool to
copy files, and can achieve very high throughput.
nuttscp -v -N 4 -l 256K -f /mydir/myfile remotehost:/data1/mydir/myfile.out
The remaining tools described on this page do not provide a way to set the TCP buffer size. However, you use
a recent OS with TCP autotuning, or if you
transfer enough files in parallel, or break a single file into enough parallel streams, one can still obtain
very good total throughput using the tools described below. Also, most of the tools below assume the files
to be copied reside on an HTTP or FTP server.
Download Managers:
Download managers provide an easy-to-use GUI for downloading multiple files in parallel, for monitoring
download progress, and for pausing/restarting downloads. Most support uploads as well, and some such
as 'Free Download Manager' even support bittorent downloads.
Some recommended download managers include:
Wikipedia maintains a
comparison table of download managers.
Other MS Windows Tools:
- filezilla: Supports FTP transfer of multiple files in parallel
Other Unix Tools:
-
sftp: Secure file transfer program.
As described above, don't even consider using this program for WAN transfers unless you have installed the
HPN patch from PSC .
But even with the patch, SFTP has the annoying characteristic of layering yet another flow control mechanism on top of everything else. By default sftp limits the total number of outstanding messages to 16 32KB messages. Since each datagram is a distinct message you end up with a 512KB outstanding data limit. You can increase both the number of outstanding messages ('-R') and the size of the message ('-B') from the command line though.
Sample command for a 128MB window:
sftp -R 512 -B 262144 user@host:/path/to/file outfile
- lftp: Supports parallel file transfer, socket tuning, HTTP transfers, and more.
Sample command:
lftp -e 'set net:socket-buffer 4000000; pget -n 4 http://site/path/file; quit'
lftp -e 'set net:socket-buffer 4000000; pget -n 4 ftp://site/path/file; quit'
axel: simple parallel accelerator for HTTP and FTP.
Sample command:
axel -n 4 http://site/file
axel -n 4 ftp://site/file
Other useful command line tools for Unix/OSX include curl
and wget.
Commercial Tools:
- Aspera sells a UDP-based solution
that does a good job utilizing all available bandwidth on congested, high latency network paths
up to 1 Gbps.
- Data Expedition
sells a set of tools based on their "Multipurpose Transaction Protocol" (MTP/IP),
a UDP-based protocol that uses a data-pull model,
and works well on congested links, satellite links, as well as high-speed
networks up to and beyond 1 Gbps.
They also provide an SDK that allows one to integrate MTP-based data transfers
into custom applications.
Special Purpose Tools:
hsi put local_file : hpss_file
hsi get hpss_file
The number of parallel streams you use is determined by which HPSS "class of service" you are using.
To adjust the TCP buffer size, look for SendSpace/RecvSpace in your hpss.conf file:
For example:
Network Options = {
Default = {
NetMask = 255.255.0.0
RFC1323 = 1
SendSpace = 4MB
RecvSpace = 4MB
WriteSize = 128KB
TcpNoDelay = 1
}
}
Note that in general pftp is faster than hsi/htar. For best performance across a WAN, use the
GridFTP interface to HPSS at sites where it is available.