Guide to Bulk Data Transfer over a WAN

Workshops
Search
File Transfer Tools Summary

When selecting a file transfer tool, one of the first things to decide is which security model you require. The basic set of options are:

  1. anonymous: (e.g.: FTP, HTTP) anyone can access the data
  2. simple password: (e.g.: FTP, HTTP) most sites no longer allow this method since the password can be easily captured
  3. password encrypted: (e.g.: bbcp, bbftp, GridFTP) control channel is encrypted, but data is unencrypted
  4. everything encrypted: (e.g.: scp, sftp, GridFTP, HTTPS-based web server) both control and data channels are encrypted

In general, most open science projects seem to prefer option #3. If you require option #4 over a WAN, the choice of tools that perform well over a WAN is limited to GridFTP with X509 keys. scp/sftp perform very badly over a WAN due to internal buffer limits.

The other issue is whether or not you have the requirement and/or ability to set up a server. HTTP and FTP based tools require a system administrator to install a web or FTP server. Other tools such as bbcp and GridFTP only require an sshd server to be installed by the administrator, and everything else can be install as a normal user.

In order to obtain maximum throughput over a high-speed WAN, one needs use a file transfer tool that includes the following features:

  • Parallel data streams
  • Ability to set the TCP buffer size
    • Luckily this is becoming less important, as Linux and Windows Vista now include "TCP buffer autotuning". Other OSes will likely follow. Note that is it still important to increase the maximum TCP buffer size even on a system that does TCP autotuning.

There are a large number of file transfer programs available, but unfortunately almost none of them provide both of these features.

The following are some commonly used tools that do provide these features.

GridFTP: part of the Globus Toolkit.

    Sample command:

    globus-url-copy -p 4 -tcp-bs 16M sshftp://data.lbl.gov/home/mydata/myfile file://home/mydir/myfile
        

    To install GridFTP with ssh support, see our Quick Start Guide. There is also a Microsoft .NET version of the GridFTP client and server available from the University of Virginia.

bbftp: from the Babar Project

    Sample command:

    bbftp -p 4 -e 'setrecvwinsize 1024; setsendwinsize 1024; put myfile' -E '/usr/local/bin/bbftpd -s' remotehost
        

bbcp: from SLAC

    Sample command:

    bbcp -P 4 -v -w 2M myfile remotehost:filename
    

    More info on using bbcp is available from Caltech.


The remaining tools described on this page do not provide a way to set the TCP buffer size. However, if you transfer enough files in parallel, or break a single file into enough parallel streams, one can still obtain good total throughput using the tools described below. Also, most of the tools below assume the files to be copied reside on an HTTP or FTP server.

Download Managers:

Other MS Windows Tools:

  • filezilla: Supports FTP transfer of multiple files in parallel

Other Unix Tools:

  • lftp: Supports parallel file transfer, socket tuning, HTTP transfers, and more.
  • Sample command:

    lftp -e 'set net:socket-buffer 4000000; pget -n 4 http://site/path/file; quit'
    lftp -e 'set net:socket-buffer 4000000; pget -n 4 ftp://site/path/file; quit'
    
  • axel: simple parallel accelerator for HTTP and FTP.
  • Sample command:

    axel -n 4 http://site/file
    axel -n 4 ftp://site/file
    

    Other useful command line tools for Unix/OSX include curl and wget.

Special Purpose Tools:

  • hsi/htar, pftp: HPSS client tools:
  • Sample command:

    hsi put local_file : hpss_file
    hsi get hpss_file
    
    The number of parallel streams you use is determined by which HPSS "class of service" you are using. To adjust the TCP buffer size, look for SendSpace/RecvSpace in your hpss.conf file: For example:
    Network Options = {
       Default = {
          NetMask = 255.255.0.0
          RFC1323 = 1
          SendSpace = 4MB
          RecvSpace = 4MB
          WriteSize = 128KB
          TcpNoDelay = 1
       }
    }
    

    Note that in general pftp is faster than hsi/htar. For best performance across a WAN, use the GridFTP interface to HPSS at sites where it is available.

The following are some commonly used tools that should be avoided:

  • scp / sftp: Do not use these tools on a WAN! They are built on top of libopenssl, which has a built in 64 KB buffer that severely limits performance on a WAN. A patch to fix this problem is available from the Pittsburgh Supercomputer Center.

© 2007, ESnet