Common datasets

Common datasets#

To help researchers save time on downloads, Stanford Research Computing hosts databases and models for commonly used software in $COMMON_DATASETS. This is a read-only storage space that is accessible to all Sherlock users, even those who do not own Oak storage.

Optimizing performance#

For faster run times and optimal performance, you should NOT run jobs against $COMMON_DATASETS directly. Researchers should instead copy your desired dataset to $SCRATCH or $GROUP_SCRATCH, and then reference that copy in your jobs.

Syncing between $SCRATCH and $COMMON_DATASETS#

Tools such as rsync or dsync can be used to restore files that may have been deleted from $SCRATCH due to the 90-day purge policy.

The code snippet below shows an example of starting an interactive session on the service partition and using dsync to copy an common dataset called data to $SCRATCH.

$ sh_dev -c 4 -p service -t 2:00:00
salloc: Granted job allocation 16755526

$ ml system mpifileutils
$ srun dsync $COMMON_DATASETS/data $SCRATCH/data