Running jobs

Login nodes are not for computing

Login nodes are shared among many users and therefore must not be used to run computationally intensive tasks. Those should be submitted to the scheduler which will dispatch them on compute nodes.

The key principle of a shared computing environment is that resources are shared among users and access to them must be scheduled. On Sherlock, it is mandatory to schedule work by requesting resources and submitting jobs to the scheduler. Because login nodes are shared by all users, they must not be used to execute computational tasks.

Acceptable use of login nodes include:

lightweight file transfers,
script and configuration file editing,
job submission and monitoring.

Resource limits are enforced

To minimize disruption and ensure a comfortable working environment for users, resource limits are enforced on login nodes. Processes started there will automatically be terminated if their resource usage (including CPU time, memory and run time) exceed those limits.

Slurm commands#

Slurm is the job scheduler used on Sherlock. It is responsible for managing the resources of the cluster and scheduling jobs on compute nodes.

There are several ways to request resources and submit jobs. The main Slurm commands to submit jobs are listed in the table below:

Command	Description	Behavior
`salloc`	Request resources and allocates them to a job	Starts a new interactive shell on a compute node
`srun`	Request resources and runs a command on the allocated compute node(s)	Blocking command: will not return until the executed command ends
`sbatch`	Request resources and runs a script on the allocated compute node(s)	Asynchronous command: will return as soon as the job is submitted

Interactive jobs#

Dedicated nodes#

Interactive jobs allow users to log in to a compute node to run commands interactively on the command line. They could be an integral part of an interactive programming and debugging workflow. The simplest way to establish an interactive session on Sherlock is to use the sh_dev command:

$ sh_dev

Tip

sh_dev is the recommended starting point for interactive work. It uses sensible defaults, runs on dedicated nodes, and typically gives you immediate access without any wait time.

This will open a login shell using one core and 4 GB of memory on one node for one hour. The sh_dev sessions run on dedicated compute nodes. This ensures minimal wait times when you need to access a node for testing script, debug code or any kind of interactive work.

sh_dev also provides X11 forwarding via the submission host (typically the login node you're connected to) and can thus be used to run GUI applications.

Compute nodes#

If you need more resources¹, you can pass options to sh_dev, to request more CPU cores, more nodes, or even run in a different partition. sh_dev -h will provide more information:

$ sh_dev -h
sh_dev: start an interactive shell on a compute node.

Usage: sh_dev [OPTIONS]
    Optional arguments:
        -c      number of CPU cores to request (OpenMP/pthreads, default: 1)
        -g      number of GPUs to request (default: none)
        -n      number of tasks to request (MPI ranks, default: 1)
        -N      number of nodes to request (default: 1)
        -m      memory amount to request (default: 4GB)
        -p      partition to run the job in (default: dev)
        -t      time limit (default: 01:00:00)
        -r      allocate resources from the named reservation (default: none)
        -J      job name (default: sh_dev)
        -q      quality of service to request for the job (default: normal)

    Note: the default partition only allows for limited amount of resources.
    If you need more, your job will be rejected unless you specify an
    alternative partition with -p.

For instance, you can request 4 CPU cores, 8 GB of memory and 1 GPU in the gpu partition with:

$ sh_dev -c 4 -m 8GB -g 1 -p gpu

Another way to get an interactive session on a compute node is to use salloc to execute a shell through the scheduler. For instance, to start a shell on a compute node in the normal partition, with the default resource requirements (one core for 2 hours), you can run:

$ salloc

The main advantage of this approach is that it will allow you to specify the whole range of submission options that sh_dev may not support.

You can also submit an existing job script or other executable as an interactive job:

$ salloc ./script.sh

Connecting to nodes#

Connecting to compute nodes

Users are not allowed to connect to compute nodes unless they have a job running there.

If you SSH to a compute node without any active job allocation, you'll be greeted by the following message:

$ ssh sh02-01n01
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed
$

Once you have a job running on a node, you can SSH directly to it and run additional processes², or observe how your application behaves, debug issues, and so on.

The salloc command supports the same parameters as sbatch, and can override any default configuration. Note that any #SBATCH directive in your job script will not be interpreted by salloc when it is executed in this way. You must specify all arguments directly on the command line for them to be taken into account.

Batch jobs#

A batch job is a script submitted to the scheduler (Slurm) for asynchronous execution on a compute node. It can run any program (an R, Python, or Matlab script, a compiled binary, or any sequence of shell commands). Submitting a job with sbatch will either start it immediately or place it in a pending state in the queue, depending on resource availability.

Job queuing#

The time a job spends pending is primarily influenced by two factors:

the number of jobs ahead of yours in the queue,
the amount and type of resources your job requests.

To minimize queue time, request only the resources necessary for your workload. Overestimating resource needs can result in longer wait times. Profiling your code interactively (for example, in an sh_dev session using tools like htop, nvtop, sacct) can help you determine appropriate resource requirements.

Requesting resources#

When submitting a job, you can request the following:

CPUs: How many CPU cores the program called in the sbatch script needs. Unless it can utilize multiple CPUs at once, request a single CPU. Check your code's documentation or profile it interactively with sh_dev and htop if unsure.
GPUs: If your code is GPU-enabled, how many GPUs does your code need?
Memory (RAM): Estimate how much memory your job will consume. Consider whether your program loads large datasets or uses significant memory on your local machine. For most jobs, the default memory allocation usually suffices.
Time: Specify how long your job will take to run to completion.
Partition: Choose the compute partition (e.g., normal, gpu, owners, bigmem).

You also need to tell the scheduler what your job should do: what modules to load, and how to execute your application. Note that any logic you can code into a bash script with the bash scripting language can also be coded into an sbatch script.

Example `sbatch` script#

The example job below will run the Python script mycode.py for 10 minutes on the normal partition using 1 CPU and 8 GB of memory. To aid in debugging, we are naming this job "test_job" and appending the Job ID (%j) to the two output files that Slurm creates when a job is executed. The output files are written to the directory in which you launched your job (you can also specify a different path). One file will contain any errors and the other will contain non-error output. Look in these 2 files ending in .err and .out for useful debugging information and error output.

test.sbatch

#!/bin/bash
#SBATCH --job-name=test_job       (1)
#SBATCH --output=test_job.%j.out  (2)
#SBATCH --error=test_job.%j.err   (3)
#SBATCH --time=10:00              (4)
#SBATCH --partition=normal        (5)
#SBATCH --cpus-per-task=1         (6)
#SBATCH --mem=8GB                 (7)

module load python/3.6.1 # (8)!
module load py-numpy/1.19.2_py36
python3 mycode.py

Job name shown in squeue output and email notifications.
Standard output file; %j is replaced by the job ID at runtime.
Standard error file; keeping it separate from stdout makes debugging easier.
Wall-clock time limit (10 minutes here); the job is killed if it exceeds this.
Partition to submit to; normal is the default general-purpose partition.
Number of CPU cores; set to 1 unless your code is explicitly multi-threaded.
Memory (RAM) to allocate; adjust based on your application's needs.
Always load software modules explicitly, with the version, for reproducibility.

More #SBATCH options

The example above covers the basics. For a curated list of commonly used directives with examples, see the SBATCH options page.

Here are the steps to create and submit this batch script:

Create and edit the sbatch script with a text editor like vim/nano or the OnDemand file manager. Then save the file, in this example we call it test.sbatch.
Submit to the scheduler with the sbatch command:
```
$ sbatch test.sbatch
```
Monitor your job and job ID in the queue with the squeue command:
```
$ squeue --me
   JOBID    PARTITION   NAME      USER      ST  TIME  NODES  NODELIST(REASON)
   4915821     normal   test_job  <userID>  PD  0:00      1  (Resources)
```
Notice that the jobs state is "Pending" (PD). Once the job starts to run, its state will change to "Running" (R), and the NODES column will indicate how many nodes are being used. The NODELIST(REASON) column will show the reason why the job is pending. In this case, it is waiting for resources to become available.
```
$ squeue --me
   JOBID    PARTITION   NAME      USER      ST  TIME  NODES  NODELIST(REASON)
   4915821     normal   test_job  <userID>   R  0:10      1  sh02-01n49
```
This last output means that job 4915821 has been running (R) on compute node sh02-01n49 for 10 seconds (0:10).

While your job is running you can connect to the node it's running on via SSH, to monitor your job in real-time. For example, if your job is running on node sh02-01n49, you can connect to it with:

$ ssh sh02-01n49

and then use tools like htop to watch processes and resource usage.

You can also manage this job based on the jobid assigned to it (4915821). For example, the job can be canceled with the scancel command.

After your job completes you can assess and fine-tune your resource requests (time, CPU/GPU, memory) with the sacct or seff commands.

Long-running jobs#

Jobs in the normal partition are limited to 2 days by default, but can run up to 7 days with the long QOS³. To use it, add --qos=long and set --time accordingly:

long_job.sbatch

#!/bin/bash
#SBATCH --job-name=long_job
#SBATCH --time=7-00:00:00    (1)
#SBATCH --qos=long           (2)
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB

srun ./my_application

Time limit in days-HH:MM:SS format; up to 7 days with --qos=long.
Required to exceed the default 2-day limit in the normal partition.

The long QOS is for non-owner users only

If you are part of a PI group that has invested in Sherlock, you can already run jobs for up to 7 days in your group's partition, without needing the long QOS. The long QOS in normal is specifically intended for users who do not have access to an owner partition.

Command-line options#

sbatch options can also be passed directly on the command line instead of (or in addition to) #SBATCH directives in the script:

$ sbatch --job-name=quick_test --time=1:00:00 --mem=4GB my_script.sh

Any #SBATCH directives inside my_script.sh are merged with the command-line options, with command-line options taking precedence in case of conflicts.

Estimating resources#

To get a better idea of the amount of resources your job will need, you can use the ruse command, available as a module:

$ module load system ruse

ruse is a command line tool developed by Jan Moren to measure a process' resource usage. It periodically measures the resource use of a process and its sub-processes, and can help you find out how much resource to allocate to your job. It will determine the actual memory, execution time and cores that individual programs or MPI applications need to request in their job submission options.

ruse periodically samples the process and its sub-processes and keeps track of the CPU, time and maximum memory use. It also optionally records the sampled values over time. The purpose of ruse is not to profile processes in detail, but to follow jobs that run for many minutes, hours or days, with no performance impact and without changing the measured application in any way.

You'll find complete documentation and details about ruse's usage on the project webpage, but here are a few useful examples.

Sizing a job#

In its simplest form, ruse can help discover how much resources a new script or application will need. For instance, you can start a sizing session on a compute node with an overestimated amount of resources, and start your application like this:

$ ruse ./myapp

This will generate a <myapp>-<pid>/ruse output file in the current directory, looking like this:

Time:           02:55:47
Memory:         7.4 GB
Cores:          4
Total_procs:    3
Active_procs:   2
Proc(%): 99.9  99.9

It shows that myapp:

ran for almost 3 hours
used a little less than 8 GB of memory
had 4 cores available,
spawned 3 processes, among which at most 2 were active at the same time,
that both active processes each used 99.9% of a CPU core

This information could be useful in tailoring the job resource requirements to its exact needs, making sure that the job won't be killed for exceeding one of its resource limits, and that the job won't have to wait too long in queue for resources that it won't use. The corresponding job request could look like this:

#SBATCH --time 3:00:00
#SBATCH --mem 8GB
#SBATCH --cpus-per-task 2

Verifying a job's usage#

It's also important to verify that applications, especially parallel ones, stay in the confines of the resources they've requested. For instance, a number of parallel computing libraries will make the assumption that they can use all the resources on the host, will automatically determine the number of physical CPU cores present on the compute node, and start as many processes. This could be a significant issue if the job requested less CPUs, as more processes will be constrained on less CPU cores, which will result in node overload and degraded performance for the application.

To avoid this, you can start your application with ruse and report usage for each time step specified with -t. You can also request the reports to be displayed directly on stdout rather than stored in a file.

For instance, this will report usage every 10 seconds:

$ ruse -s -t10 --stdout ./myapp
   time         mem   processes  process usage
  (secs)        (MB)  tot  actv  (sorted, %CPU)
     10        57.5    17    16   33  33  33  25  25  25  25  25  25  25  25  20  20  20  20  20
     20        57.5    17    16   33  33  33  25  25  25  25  25  25  25  25  20  20  20  20  20
     30        57.5    17    16   33  33  33  25  25  25  25  25  25  25  25  20  20  20  20  20

Time:           00:00:30
Memory:         57.5 MB
Cores:          4
Total_procs:   17
Active_procs:  16
Proc(%): 33.3  33.3  33.2  25.0  25.0  25.0  25.0  25.0  25.0  24.9  24.9  20.0  20.0  20.0  20.0  19.9

Here, we can see that despite having being allocated 4 CPUs, the application started 17 threads, 16 of which were active running intensive computations, with the unfortunate consequence that each process could only use a fraction of a CPU.

In that case, to ensure optimal performance and system operation, it's important to modify the application parameters to make sure that it doesn't start more computing processes than the number of requested CPU cores.

Available resources#

Whether you are submitting a batch job, or an or interactive job, it's important to know the resources that are available to you. For this reason, we provide sh_part, a command-line tool to help answer questions such as:

which partitions do I have access to?
how many jobs are running on them?
how many CPUs can I use?
where should I submit my jobs?

sh_part can be executed on any login or compute node to see what partitions are available to you, and its output looks like this:

$ sh_part
 partition           || nodes         | CPU cores             | GPUs                 || job runtime     | mem/core        | per-node
 name         public ||   idle  total |   idle  total  queued |   idle  total queued || default maximum | default maximum |    cores   mem(GB)  gpus
-----------------------------------------------------------------------------------------------------------------------------------------------------
 normal*      yes    ||      0    218 |    438   5844    6949 |      0      0      0 ||      2h      7d |     6GB     8GB |    20-64   128-384     0
 bigmem       yes    ||      0     11 |    537    824     255 |      0      0      0 ||      2h      1d |     6GB    64GB |   24-256  384-4096     0
 gpu          yes    ||      0     33 |    354   1068     905 |     25    136    196 ||      1h      2d |     8GB    32GB |    20-64  191-2048   4-8
 dev          yes    ||      1      4 |     64    104       0 |     62     64      0 ||      1h      2h |     6GB     8GB |    20-32   128-256  0-32
 service      yes    ||      5      6 |    129    132       0 |      0      0      0 ||      1h      2h |     1GB     8GB |    20-32   128-256     0
-----------------------------------------------------------------------------------------------------------------------------------------------------

The above example shows four possible partitions where jobs can be submitted: normal, bigmem, gpu,, dev, and service. It also provides additional information such as the maximum amount of time allowed in each partition, the number of other jobs already in queue, along with the ranges of resources available on nodes in each partition. In particular:

in the partition name column, the * character indicates the default partition.
the queued columns show the amount of CPU cores or GPUs requested by pending jobs,
the per-node columns show the range of resources available on each node in the partition. For instance, the gpu partition has nodes with 20 to 64 CPU cores and 191 to 2048 GB of memory, and up to 8 GPUs per node.

Public partitions#

Here are the main public partitions available to everyone on Sherlock:

Partition	Purpose	Resources	Limits
`normal`	General purpose compute jobs	20-64 cores/node, 6-8 GB RAM/core	default runtime of 2 hours, max. 2 days (up to 7 days with the `long` QOS³)
`bigmem`	High memory compute jobs	for jobs requiring > 256GB, up to 4 TB RAM/node	Maximum runtime of 1 day
`gpu`	GPU compute jobs	20-64 cores/node, up to 2TB RAM/node, 4 or 8 GPUs/node	16 GPUs/user
`dev`	Development and testing jobs	dedicated nodes and lightweight GPU instances (MIG)	2h max, 4 cores + 2 GPUs/user
`service`	Lightweight, recurring administrative tasks	massively over-subscribed resources	2 jobs, 16 cores/user, 2 days runtime

Owner partitions#

Research groups that have invested in Sherlock get access to a dedicated partition named after their PI SUNet ID. Jobs in owner partitions can run for up to 7 days without any special QOS, and owners get immediate access to their nodes with no wait time in queue.

The `owners` partition#

All members of owner groups also have access to the shared owners partition, which spans the nodes contributed by all PI groups. This makes a much larger pool of resources available, at the cost of potential preemption: when a node's purchasing group needs their resources back, any jobs from other owner groups running on that node are preempted (i.e. killed and requeued) to make room. Jobs that are preempted are automatically requeued and will restart when resources are available again, so it is important to make sure your jobs can handle being interrupted and restarted (e.g. by checkpointing regularly).

Tip

The owners partition is a good choice when you need to scale out and can tolerate occasional restarts. Use your group's dedicated partition when you need guaranteed, uninterrupted access to resources.

High-priority QOS#

Within an owner partition, all jobs share the same priority by default. If you need to push a specific job ahead of others already in queue (for instance to meet a deadline, or to distinguish foreground work from background jobs), you can use the high_p QOS:

#SBATCH --partition=<PI_partition>
#SBATCH --qos=high_p

Jobs submitted with --qos=high_p get a priority boost over other pending jobs in the same partition, so they will start sooner when resources become available.

Info

The high_p QOS is only available in owner partitions. It has no effect in public partitions like normal or gpu.

Service jobs#

For lightweight, recurring, or persistent tasks (data transfers, backups, database servers, cron-like jobs), Sherlock provides a dedicated service partition. See the Service jobs page for full details, including examples of recurring and persistent job scripts.

The dedicated partition that sh_dev uses by default only allows up to 2 cores and 8 GB or memory per user at any given time. So if you need more resources for your interactive session, you may have to specify a different partition. ↩
Please note that your SSH session will be attached to your running job, and that resources used by that interactive shell will count towards your job's resource limits. So if you start a process using large amounts of memory via SSH while your job is running, you may hit the job's memory limits, which will trigger its termination. ↩
the long QOS can only be used in the normal partition, and is only accessible to users who are not part of an owners group (since owner groups can already run for up to 7 days in their respective partition). ↩↩