Skip to content

Common datasets

To help researchers save time on downloads, Stanford Research Computing hosts databases and models for commonly used software in $COMMON_DATASETS. This is a read-only storage space that is accessible to all Sherlock users.

Available datasets#

The following datasets are currently available in $COMMON_DATASETS:

Dataset Path Description
AlphaFold 3 $COMMON_DATASETS/alphafold3 Genetic sequence and structural template databases for AlphaFold 3.
See our AlphaFold documentation for instructions on running it on Sherlock.
NCBI BLAST databases $COMMON_DATASETS/blast Sequence databases for use with NCBI BLAST and related tools.
Copy the databases you need to $SCRATCH, then set BLASTDB=$SCRATCH/blast before running BLAST.
Ollama models $COMMON_DATASETS/ollama Pre-downloaded LLM models for use with Ollama.
Automatically integrated with the ollama module (no manual setup needed). See our Ollama documentation for more details.

To see the full and up-to-date list of available datasets, run:

$ ls $COMMON_DATASETS

Optimizing performance#

For faster run times and optimal performance, you should NOT run jobs against $COMMON_DATASETS directly. Instead, copy your desired dataset to $SCRATCH or $GROUP_SCRATCH, and then reference that copy in your jobs.

Maintaining local copies#

Tools such as rsync or dsync can be used to restore files that may have been deleted from $SCRATCH due to the 90-day purge policy.

For example, to synchronize the AlphaFold 3 databases to $SCRATCH:

$ sh_dev -c 4 -p service -t 2:00:00
salloc: Granted job allocation 16755526

$ ml system mpifileutils
$ srun dsync $COMMON_DATASETS/alphafold3 $SCRATCH/alphafold3