Anaconda
Introduction#
Anaconda is a Python/R distribution that aims to simplify package management and deployment for scientific computing. Although it can have merits on individual computers, it's often counter-productive on shared HPC systems like Sherlock.
Avoid using Anaconda on Sherlock
We recommend NOT using Anaconda on Sherlock, and instead consider other options like virtual environments or containers.
Why Anaconda should be avoided on Sherlock#
Anaconda is widely used in several scientific domain like data science, AI/ML, bio-informatics, and is often listed in some software documentation as the recommended (if not only) way to install it
It is a useful solution for simplifying the management of Python and scientific libraries on a personal computer. However, on highly-specialized HPC systems like Sherlock, management of these libraries and dependencies should be done by Stanford Research Computing staff, to ensure compatibility and optimal performance on the cluster hardware.
For instance:
- Anaconda very often installs software (compilers, scientific libraries etc.) which already exist on our Sherlock as modules, and does so in a sub-optimal fashion, by installing sub-optimal versions and configurations,
- It installs binaries which are not optimized for the processor architectures on Sherlock,
- it makes incorrect assumptions about the location of various system libraries,
- Anaconda installs software in
$HOME
by default, where it writes large amounts of files. A single Anaconda installation can easily fill up your$HOME
directory quota, and makes things difficult to manage, - Anaconda installations can't easily be relocated,
- Anaconda modifies your
$HOME/.bashrc
file, which can easily cause conflicts and slow things down when you log in.
Worse, a Conda
recipe can force the installation of R
(even though it's already available on Sherlock). This installation won't perform nearly as well as the version we provide as a module (which uses optimized libraries), or not at all, the jobs launched with it may crash and end up wasting both computing resources and your time.
Installation issues
If you absolutely need to install anaconda
/miniconda
, please note that because of the large number of files that the installer will try to open, this will likely fail on a login node. So make sure to run the installation on a compute node, for instance using the sh_dev
command.
What to do instead#
Use a virtual environment#
Instead of using Anaconda for your project, or when the installation instructions of the software you want to install are using it, you can use a virtual environment.
A virtual environment offers all the functionality you need to use Python on Sherlock. You can convert Anaconda instructions and use a virtual environment instead, by following these steps:
- list the dependencies (also called requirements) of the application you want to use:
- check if there is a
requirements.txt
file in the Git repository or in the software sources, - or, check the variable
install_requires
of in thesetup.py
file, which lists the requirements.
- check if there is a
- find which dependencies are Python modules and which are libraries provided by Anaconda. For example,
CUDA
andCuDNN
are libraries that Anaconda can install, but which should not be re-installed as they are already available as modules on Sherlock, - remove from the list of dependencies everything which is not a Python module (e.g.
cudatoolkit
andcudnn
), - create a virtual environment to install your dependencies.
And that's it: your software should run, without Anaconda. If you have any issues, please don't hesitate to contact us.
Use a container#
In some situations, the complexity of a program's dependencies requires the use of a solution where you can control the entire software environment. In these situations, we recommend using a container.
Tip
Existing Docker images can easily be converted into Apptainer/Singularity images.
The only potential downside of using containers is their size and the associated storage usage. But if your research group plans on using several container images, it could be useful to collect them all in a single location (like $GROUP_HOME
) to avoid duplication.