Historically, wide-area bulk data transfer has been plagued by poor performance for a variety of reasons. These include improper configuration of the sending and receiving hosts, software design issues, firewalls, and other factors. In most cases, however, large data sets can be moved long distances using today's networks with minimal effort.
A common technique to speed up file transfers is to break the file into smaller pieces that are transferred in parallel. A number of tools include the option to do parallel transfers. If you have a large number of files to copy, you can do parallel transfers by copying several files at once (typically 4-5 is a good number to try). But in general it is more efficient to copy larger files than smaller files, so bundling multiple small files into a single larger file using tar or zip is also recommended.
Selecting a File Transfer Tool
When selecting a file transfer tool, one of the first things to decide is which security model you require. The basic set of options are:
- anonymous: (e.g.: FTP, HTTP) anyone can access the data
- simple password: (e.g.: FTP, HTTP) most sites no longer allow this method since the password can be easily captured
- password encrypted: (e.g.: bbcp, bbftp, Globus Online/GridFTP, FDT) control channel is encrypted, but data is unencrypted
- everything encrypted: (e.g.: scp, sftp, rsync over ssh, GridFTP, HTTPS-based web server) both control and data channels are encrypted
In general, most open science projects seem to prefer option #3. If you require option #4 over a WAN, the choice of tools that perform well over a WAN is limited to GridFTP with X509 keys, or possibly HPN-patched ssh tools (e.g. HPN-patched scp/sftp or rsync over HPN-patched ssh).
The other issue is whether or not you have the requirement and/or ability to set up a server. HTTP and FTP based tools require a system administrator to install a web or FTP server. Other tools such as bbcp and GridFTP only require an sshd server to be installed by the administrator, and everything else can be install as a normal user. Globus Online is the only tool that does not require the user to install any software, as long as the endpoints (source and destination locations) are known within the system.
In order to obtain maximum throughput over a high-speed WAN, one needs use a file transfer tool that includes support for parallel data stream. There are a large number of file transfer programs available. Unfortunately almost none of them provide both of these features.
At this time we recommend Globus Online or GridFTP in ssh mode as having the best combination of features and support.