Data transfer
Transfer protocols#
A number of methods allow transferring data in/out of Sherlock. For most cases, we recommend using SSH-based file transfer commands, such as scp
, sftp
, or rsync
. They will provide the best performance for data transfers from and to campus.
For large transfers, using DTNs is recommended
Most casual data transfers could be done through the login nodes, by pointing your transfer tool to login.sherlock.stanford.edu
. But because of resource limits on the login nodes, larger transfer may not work as expected.
For transferring large amounts of data, Sherlock features a specific Data Transfer Node, with dedicated bandwidth, as well as a managed Globus endpoint, that can be used for scheduled, unattended data transfers.
We also provide tools on Sherlock to transfer data to various Cloud providers, such as AWS, Google Drive, Dropbox, Box, etc.
Prerequisites#
Most of the commands detailed below require a terminal and an SSH client1 on your local machine to launch commands.
You'll need to start a terminal and type the given example commands at the prompt, omitting the initial $
character (it just indicates a command prompt, and then should not be typed in).
Host keys#
Upon your very first connection to Sherlock, you will be greeted by a warning such as :
The authenticity of host 'login.sherlock.stanford.edu' can't be established.
ECDSA key fingerprint is SHA256:eB0bODKdaCWtPgv0pYozsdC5ckfcBFVOxeMwrNKdkmg.
Are you sure you want to continue connecting (yes/no)?
The same warning will be displayed if your try to connect to one of the Data Transfer Node (DTN):
The authenticity of host 'dtn.sherlock.stanford.edu' can't be established.
ECDSA key fingerprint is SHA256:eB0bODKdaCWtPgv0pYozsdC5ckfcBFVOxeMwrNKdkmg.
Are you sure you want to continue connecting (yes/no)?
This warning is normal: your SSH client warns you that it is the first time it sees that new computer. To make sure you are actually connecting to the right machine, you should compare the ECDSA key fingerprint shown in the message with one of the fingerprints below:
Key type | Key Fingerprint |
---|---|
RSA | SHA256:T1q1Tbq8k5XBD5PIxvlCfTxNMi1ORWwKNRPeZPXUfJA legacy format: f5:8f:01:46:d1:f9:66:5d:33:58:b4:82:d8:4a:34:41 |
ECDSA | SHA256:eB0bODKdaCWtPgv0pYozsdC5ckfcBFVOxeMwrNKdkmg legacy format: 70:4c:76:ea:ae:b2:0f:81:4b:9c:c6:5a:52:4c:7f:64 |
If they match, you can proceed and type ‘yes’. Your SSH program will then store that key and will verify it for every subsequent SSH connection, to make sure that the server you're connecting to is indeed Sherlock.
Host keys warning#
If you've connected to Sherlock 1.0 before, there's a good chance the Sherlock 1.0 keys were stored by your local SSH client. In that case, when connecting to Sherlock 2.0 using the sherlock.stanford.edu
alias, you will be presented with the following message:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The RSA host key for sherlock.stanford.edu has changed, and the key for
the corresponding IP address 171.66.97.101 is unknown. This could
either mean that DNS SPOOFING is happening or the IP address for the
host and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle
attack)! It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:T1q1Tbq8k5XBD5PIxvlCfTxNMi1ORWwKNRPeZPXUfJA.
Please contact your system administrator.
You can just check that the SHA256 key listed in that warning message correctly matches the one listed in the table above, and if that's the case, you can safely remove the sherlock.stanford.edu
entry from your ~/.ssh/known_hosts
file with the following command on your local machine:
$ ssh-keygen -R sherlock.stanford.edu
and then connect again. You'll see the first-connection prompt mentioned above, and your SSH client will store the new keys for future connections.
SSH-based protocols#
User name
In all the examples below, you'll need to replace <sunetid>
by your actual SUNet ID. If you happen to use the same login name on your local machine, you can omit it.
SCP (Secure Copy)#
-
The easiest command to use to transfer files to/from Sherlock is
scp
. It works like thecp
command, except it can work over the network to copy files from one computer to another, using the secure SSH protocol.The general syntax to copy a file to a remote server is:
$ scp <source_file_path> <username>@<remote_host>:<destination_path>'
For instance, the following command will copy the file named
foo
from your local machine to your home directory on Sherlock:Note the$ scp foo <sunetid>@login.sherlock.stanford.edu:
:
character, that separates the hostname from the destination path. Here, the destination path is empty, which will instruct scp to copy the file in your home directory.You can copy
foo
under a different name, or to another directory, with the following commands:$ scp foo <sunetid>@login.sherlock.stanford.edu:bar $ scp foo <sunetid>@login.sherlock.stanford.edu:~/subdir/baz
To copy back files from Sherlock to your local machine, you just need to reverse the order of the arguments:
$ scp <sunetid>@login.sherlock.stanford.edu:foo local_foo
And finally,
scp
also support recursive copying of directories, with the-r
option:This will copy the$ scp -r dir/ <sunetid>@login.sherlock.stanford.edu:dir/
dir/
directory and all of its contents in your home directory on Sherlock.
SFTP (Secure File Transfer Protocol)#
-
SFTP clients are interactive file transfer programs, similar to FTP, which perform all operations over an encrypted transport.
A variety of graphical SFTP clients are available for different OSes:
When setting up your connection to Sherlock in the above programs, use the following information:
Hostname: login.sherlock.stanford.edu Port: 22 Username: SUNet ID Password: SUNet ID password
OpenSSH also provides a command-line SFTP client, originally named
sftp
.To log in to Sherlock:
For more information about using the command-line SFTP client, you can refer to this tutorial for more details and examples.$ sftp <sunetid>@login.sherlock.stanford.edu Connected to login.sherlock.stanford.edu. sftp>
rsync
#
-
If you have complex hierarchies of files to transfer, or if you need to synchronize a set of files and directories between your local machine and Sherlock,
rsync
will be the best tool for the job. It will efficiently transfer and synchronize files across systems, by checking the timestamp and size of files. Which means that it won't re-transfer files that have not changed since the last transfer, and will complete faster.For instance, to transfer the whole
~/data/
folder tree from your local machine to your home directory on Sherlock, you can use the following command:Note the slash ($ rsync -a ~/data/ <sunetid>@login.sherlock.stanford.edu:data/
/
) at the end of the directories name, which is important to instructrsync
to synchronize the whole directories.To get more information about the transfer rate and follow its progress, you can use additional options:
For more information about using the$ rsync -avP ~/data/ <sunetid>@login.sherlock.stanford.edu:data/ sending incremental file list ./ file1 1,755,049 100% 2.01MB/s 0:00:00 (xfr#2, to-chk=226/240) file2 2,543,699 100% 2.48MB/s 0:00:00 (xfr#3, to-chk=225/240) file3 34,930,688 19% 72.62MB/s 0:00:08 [...]
rsync
, you can refer to this tutorial for more details and examples.
SSHFS#
-
Sometimes, moving files in and out of the cluster, and maintaining two copies of each of the files you work on, both on your local machine and on Sherlock, may be painful. Fortunately, Sherlock offers the ability to mount any of its filesystems to your local machine, using a secure and encrypted connection.
With SSHFS, a FUSE-based filesystem implementation used to mount remote SSH-accessible filesystems, you can access your files on Sherlock as if they were locally stored on your own computer.
This comes particularly handy when you need to access those files from an application that is not available on Sherlock, but that you already use or can install on your local machine. Like a data processing program that you have licensed for your own computer but can't use on Sherlock, a specific text editor that only runs on macOS, or any data-intensive 3D rendering software that wouldn't work comfortably enough over a forwarded X11 connection.
SSHFS is available for Linux , macOS , and Windows .
SSHFS on macOS
SSHFS on macOS is known to try to automatically reconnect filesystem mounts after resuming from sleep or suspend, even without any valid credentials. As a result, it will generate a lot of failed connection attempts and likely make your IP address blacklisted on login nodes.
Make sure to unmount your SSHFS drives before putting your macOS system to sleep to avoid this situation.
The following option could also be useful to avoid some permission issues:
-o defer_permissions
For instance, on a Linux machine with SSHFS installed, you could mount your Sherlock home directory via a Sherlock DTN with the following commands:
$ mkdir ~/sherlock_home $ sshfs <sunetid>@dtn.sherlock.stanford.edu:./ ~/sherlock_home
Using DTNs for data transfer
Using the Sherlock DTNs instead of login nodes will ensure optimal performance for data transfers. Login nodes only have limited resources, that could limit data transfer rates or disconnect during long data transfers.
And to unmount it:
$ umount ~/sherlock_home
On Windows, once SSHFS is installed, you can mount the
$SCRATCH
filesystem as a network drive through the windows file explorer. To do this, go to "This PC", right-click in the "Network Locations" section of the window and select "Add a Network Drive". Then, in the "Add Network Location Wizard", you would use the following network address:\\sshfs\<sunetid>@dtn.sherlock.stanford.edu
This will mount the
$SCRATCH
partition as a network drive on your PC.For more information about using SSHFS on your local machine, you can refer to this tutorial for more details and examples.
Globus#
Globus improves SSH-based file transfer protocols by providing the following features:
- automates large data transfers,
- handles transient errors, and can resume failed transfers,
- simplifies the implementation of high-performance transfers between computing centers.
Globus is a Software as a Service (SaaS) system that provides end-users with a browser interface to initiate data transfers between endpoints. Globus allows users to "drag and drop" files from one endpoint to another. Endpoints are terminals for data; they can be laptops or supercomputers, and anything in between. The Globus web service negotiates, monitors, and optimizes transfers through firewalls and across network address translation (NAT). Under certain circumstances, with high performance hardware transfer rates exceeding 1 GB/s are possible. For more information about Globus, please see the Globus documentation.
Authentication#
To use Globus, you will first need to authenticate at Globus.org. You can either sign up for a Globus account, or use your SUNet ID account for authentication to Globus (which will be required to authenticate to the Sherlock endpoint).
To use your SUNet ID, choose "Stanford University" from the drop down menu at the Login page and follow the instructions from there.
Transfer#
Endpoint name
The Globus endpoint name for Sherlock is SRCC Sherlock
.
Oak endpoint
The Sherlock endpoint only provides access to Sherlock-specific file systems ($HOME
, $GROUP_HOME
, $SCRATCH
and $GROUP_SCRATCH
). Oak features its own Globus endpoint: SRCC Oak
.
You can use Globus to transfer data between your local workstation (e.g., your laptop or desktop) and Sherlock. In this workflow, you configure your local workstation as a Globus endpoint by installing the Globus Connect software.
- Log in to Globus.org
- Use the Manage Endpoints interface to "add Globus Connect Personal" as an endpoint (you'll need to install Globus Connect Personal on your local machine)
- Transfer Files, using your new workstation endpoint for one side of the transfer, and the Sherlock endpoint (
SRCC Sherlock
) on the other side.
You can also transfer data between two remote endpoints, by choosing another endpoint you have access to instead of your local machine.
CLI and API#
Globus also provides a command-line interface (CLI) and application programming interface (API) as alternatives to its web interface.
For more information about the API, please see the Globus API documentation for more details.
For more information about the CLI, please see the Globus CLI documentation and Globus CLI quick start. Note that the Globus CLI is available through the module system on Sherlock:
$ module load system py-globus-cli
$ globus login
# follow instructions to get set up
Once you've authorized the application, you can use the globus
CLI to copy files in between endpoints and collections that you have access to. Endpoints and collections are identified by their unique UUID4 identifiers, which are viewable through the Globus web app. The CLI will step you through any additional authorizations required for you to access the endpoints or collections.
For example, to asynchronously copy files between Sherlock and Oak (if that you have already been allocated Oak storage):
$ GLOBUS_SHERLOCK_UUID="6881ae2e-db26-11e5-9772-22000b9da45e"
$ GLOBUS_OAK_UUID="8b3a8b64-d4ab-4551-b37e-ca0092f769a7"
$ globus transfer --recursive \
"$GLOBUS_SHERLOCK_UUID:$SCRATCH/my-interesting-project" \
"$GLOBUS_OAK_UUID:$OAK/my-interesting-project-copy"
Data Transfer Nodes (DTNs)#
No shell
The DTNs don't provide any interactive shell, so connecting via SSH directly won't work. It will only accept scp
, sftp
, rsync
of bbcp
connections.
A pool of dedicated Data Transfer Nodes is available on Sherlock, to provide exclusive resources for large-scale data transfers.
The main benefit of using it is that transfer tasks can't be disrupted by other users interactive tasks or filesystem access and I/O-related workloads on the login nodes.
By using the Sherlock DTNs, you'll make sure that your data flows will go through a computer whose sole purpose is to move data around.
It supports:
- SSH-based protocols (such as the ones described above)
bbcp
- Globus
To transfer files via the DTNs, simply use dtn.sherlock.stanford.edu
as a remote server host name. For instance:
$ scp foo <sunetid>@dtn.sherlock.stanford.edu:~/foo
$HOME on DTNs
One important difference to keep in mind when transferring files through the Sherlock DTNs is that the default destination path for files, unless specified, is the user $SCRATCH
directory, not $HOME
.
That means that the following command:
$ scp foo <sunetid>@dtn.sherlock.stanford.edu:
foo
file in $SCRATCH/foo
, and not in $HOME/foo
. You can transfer file to your $HOME
directory via the DTNs by specifying the full path as the destination: $ scp foo <sunetid>@dtn.sherlock.stanford.edu:$HOME/foo
Cloud storage#
If you need to backup some of your Sherlock files to cloud-based storage services, we also provide a set of utilities that can help.
Google Drive#
Google Drive storage for Stanford users
For more information about using Google Drive at Stanford, please see the University IT Google Drive page.
We provide the rclone
tool on Sherlock to interact with Google Drive. You'll just need to load the rclone
module to be able to use it to move your files from/to Google Drive:
$ module load system rclone
$ rclone --help
This tutorial provides an example of transferring files between Google Drive and Oak storage.
The Globus CLI (see above) can also be used to copy files from Sherlock to Stanford's Google Drive.
AWS#
You can also access AWS storage from the Sherlock command line with the AWS Command Line Interface:
$ module load system aws-cli
$ aws help
Other services#
If you need to access other cloud storage services, you can use rclone
: it can be used to sync files and directories to and from Google Drive, Amazon S3, Box, Dropbox, Google Cloud Storage, Amazon Drive, Microsoft OneDrive and many more.
$ ml load system rclone
$ rclone -h
For more details about how to use rclone
, please see the official documentation.
-
For more details, see the SSH clients page. ↩
-
Fetch is a commercial program, and is available as part of the Essential Stanford Software bundle. ↩