Common datasets

To help researchers save time on downloads, Stanford Research Computing hosts databases and models for commonly used software in $COMMON_DATASETS. This is a read-only storage space that is accessible to all Sherlock users.

Available datasets#

The following datasets are currently available in $COMMON_DATASETS:

Dataset	Path	Description
AlphaFold 3	`$COMMON_DATASETS/alphafold3`	Genetic sequence and structural template databases for AlphaFold 3. See our AlphaFold documentation for instructions on running it on Sherlock.
NCBI BLAST databases	`$COMMON_DATASETS/blast`	Sequence databases for use with NCBI BLAST and related tools. Copy the databases you need to `$SCRATCH`, then set `BLASTDB=$SCRATCH/blast` before running BLAST.
Ollama models	`$COMMON_DATASETS/ollama`	Pre-downloaded LLM models for use with Ollama. Automatically integrated with the `ollama` module (no manual setup needed). See our Ollama documentation for more details.

To see the full and up-to-date list of available datasets, run:

$ ls $COMMON_DATASETS

Optimizing performance#

For faster run times and optimal performance, you should NOT run jobs against $COMMON_DATASETS directly. Instead, copy your desired dataset to $SCRATCH or $GROUP_SCRATCH, and then reference that copy in your jobs.

Maintaining local copies#

Tools such as rsync or dsync can be used to restore files that may have been deleted from $SCRATCH due to the 90-day purge policy.

For example, to synchronize the AlphaFold 3 databases to $SCRATCH:

$ sh_dev -c 4 -p service -t 2:00:00
salloc: Granted job allocation 16755526

$ ml system mpifileutils
$ srun dsync $COMMON_DATASETS/alphafold3 $SCRATCH/alphafold3