Introduction to Slurm
- What is Slurm?
- Slurm Commands
  - sbatch
  - srun
  - squeue
  - sinfo
  - sacct
  - scancel

Introduction to Slurm

What is Slurm?

With an HPC facility, unlike running code on your own desktop or laptop, you do not directly access the various hardware which makes up the system.

Instead, in most cases, you instruct a special software service, known as the scheduler, to run applications on your behalf. The scheduler analyses the type of job you want to run, taking note of the type and quantity of resources you indicate it needs (number of processors, amount of memory, etc) and finds the best place to fit your job from the available servers which make up the HPC facility.

Since the scheduler does this on behalf of all users, it is able to efficiently coordinate the running of thousands of jobs at the same time, managing the distribution of jobs across the facility to make the best use of the available resources, and find space for your jobs to fit. It also monitors all of the jobs running across the HPC facility to ensure that they do not exceed the resources which they have been allocated.

Without having an overview of everything that is currently running, or is waiting to run, if you were to try running jobs yourself you would likely find situations where you started a job on a server that didn't have enough free RAM, or where all of the processors were already allocated to existing jobs - leading to a conflict over available resources and jobs not running efficiently.

On a small system where there are only a small number of physical servers this would seem more complicated than needed, but once you start using systems that have dozens, hundreds or even thousands of physical servers, it rapidly becomes impractical for one person to keep track of what jobs are running where, and what the capabilities of each server are. The scheduler does this for you.

We use Slurm; an open-source scheduler which is very popular, but almost every HPC facility will use a scheduler of some sort, and most of them are comparable in terms of functionality. For more technical implementation details regarding Slurm, you can access the website operated by the maintainers of the software:

https://slurm.schedmd.com/quickstart.html

You can find out more about the various types of scheduling software at the independent High Performance Computing information wiki:

https://hpc-wiki.info/hpc/Scheduling_Basic

Read on to cover the most common Slurm commands you will encounter in your use of HPC facilities at Newcastle University.

Slurm Commands

For the vast majority of those using HPC facilities at Newcastle University you will probably only ever need to use a very small handful of the possible Slurm commands. These are:

sbatch - Submitting jobs to the scheduler
srun - Run a single command via the scheduler
squeue - Job information (for pending and running only)
sinfo - Partition details
sacct - Job information (for historic jobs, including performance data)
scancel - Cancel a scheduled (or running) job

sbatch

The sbatch command takes the job file you have written and submits it to the scheduler to be queued. It is strongly suggested that you submit your jobs via sbatch, as this allows you to use software version control tools to manage your job files - they are code themselves.

For example, if your job file was named myjobfile.sh and was in the scripts sub-directory of your home directory:

$ sbatch $HOME/scripts/myjobfile.sh
Submitted batch job 123456789
$

The Job ID which is returned can be used to view status information on the job, obtain it's position in the scheduler queue, view performance data, cancel it, etc. As is usual, you will not normally see any other output in the terminal at this point; your job should log all output to the file slurm-<job_ID>.out, which in the example above would be slurm-123456789.out.

This file will be created in the same directory you ran the sbatch command.

Full information on the sbatch command can be obtained while logged in to our HPC facilities with the man command:

$ man sbatch

https://slurm.schedmd.com/sbatch.html

srun

The srun command submits a command to run as single job to the scheduler and runs it interactively, showing output to the user, and then returns to the original session.

For example, running a batch file named myscript.sh, which displays the hostname of a given machine, and prints the value of Pi, running in a partition named short:

$ srun --partition=short $HOME/scripts/myscript.sh
srun: job 123456799 queued and waiting for resources
srun: job 123456799 has been allocated resources
Hello I am running on sb056.cluster
Pi calculated as 3.14
$

Compared to sbatch, jobs run via srun will display all output in the current terminal window, by default. They will also pause between being submitted and being allocated the requested resources - in the example above, the time period between the queued and waiting for resources and has been allocated resources text is not guaranteed; it could be seconds or it could be many minutes (or hours).

It is recommended, unless you know exactly what you are doing, to run jobs via sbatch instead.
Most users who want to run MPI jobs will often start them with srun from their SBATCH job files and this is the recommended method versus mpirun as it automatically inherits much of the setup from your #SBATCH headers.

All of the SBATCH header settings are available as command line options to the srun command, so you can set the partition, number of cpus, runtime etc, just as you would with sbatch. The use of srun can be beneficial if you want to test something quickly, observing output as a job runs, but it is not reccomended to use it for anything long running (as unlike sbatch you will have to keep your terminal open!).

Full information on the srun command can be obtained while logged in to our HPC facilities with the man command:

$ man srun

https://slurm.schedmd.com/srun.html

squeue

The squeue command will, by default, show a list of all currently running and all pending jobs that the scheduler knows about, for all users, as well as those jobs which are currently waiting to run.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          17864081    bigmem  bd6_job abc12345  R   23:23:02      1 xln02
          17864079    bigmem  bd5_job abc12345  R 1-09:31:39      1 xln01
          17857465    bigmem  bd3_job abc12345  R 3-11:36:38      1 ln04
          17864082    bigmem  bd7_job abc12345  R 2-17:26:22      1 ln02
          17864072    bigmem  bd4_job abc12345  R 2-18:41:22      1 ln03
          17867113    bigmem   USTAR1 n12345    R 1-05:31:14      1 mb01
          17824926 bigmem,de vep4-Sen n45678    PD      0:00      1 (DependencyNeverSatisfied)
          17878168      defq f3dyn.sh bcd23456  PD      0:00      4 (Resources)
          17878169      defq q5R8e5r1 n67832    PD      0:00      4 (Resources)
          17880197      defq highrevo defg78901 PD      0:00      4 (Resources)
          17877696      defq T1_9_VNS n87654    PD      0:00      2 (Dependency)
          17877695      defq T1_9_VNS n87654    PD      0:00      2 (Dependency)
          17877637      defq T1_9_VNS n87654    PD      0:00      2 (Dependency)
          17877469      defq T1_7_VNS n87654    PD      0:00      2 (Dependency)
          17878216      defq nemo_72. n91267    R    3:03:47      1 sb059
          17864625      defq run6-Por nabc123   R   22:51:51     12 sb[049-051,053,055-057,060-062,075,080]
$

You can easily restrict the output to a single user:

$ squeue --user=abc12345
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          17864081    bigmem  bd6_job abc12345  R   23:23:02      1 xln02
          17864079    bigmem  bd5_job abc12345  R 1-09:31:39      1 xln01
          17857465    bigmem  bd3_job abc12345  R 3-11:36:38      1 ln04
          17864082    bigmem  bd7_job abc12345  R 2-17:26:22      1 ln02
          17864072    bigmem  bd4_job abc12345  R 2-18:41:22      1 ln03
$

There is a useful shorthand for your own jobs by using the –me option:

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          17864081    bigmem  bd6_job abc12345  R   23:23:02      1 xln02
          17864079    bigmem  bd5_job abc12345  R 1-09:31:39      1 xln01
          17857465    bigmem  bd3_job abc12345  R 3-11:36:38      1 ln04
          17864082    bigmem  bd7_job abc12345  R 2-17:26:22      1 ln02
          17864072    bigmem  bd4_job abc12345  R 2-18:41:22      1 ln03
$

Or look for just the waiting jobs:

$ squeue --states=PD
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          17824926 bigmem,de vep4-Sen  n123213 PD       0:00      1 (DependencyNeverSatisfied)
          17882234      defq    kk.sh    n1225 PD       0:00      6 (Resources)
          17878168      defq f3dyn.sh b7891234 PD       0:00      4 (Resources)
          17878169      defq q5R8e5r1    n9927 PD       0:00      4 (Resources)
          17880197      defq highrevo b7891234 PD       0:00      4 (Resources)
          17877696      defq T1_9_VNS    n5627 PD       0:00      2 (Dependency)
          17877695      defq T1_9_VNS    n5627 PD       0:00      2 (Dependency)
          17877637      defq T1_9_VNS    n5627 PD       0:00      2 (Dependency)
          17877469      defq T1_7_VNS    n5627 PD       0:00      2 (Dependency)
          17772908 defq,long Birmingh a1234567 PD       0:00      1 (DependencyNeverSatisfied)
          17772519 defq,long Finland_ a1234567 PD       0:00      1 (DependencyNeverSatisfied)
          17763133 defq,long UK1_OSst a1234567 PD       0:00      1 (DependencyNeverSatisfied)
          17777214 defq,long OSstatus a1234567 PD       0:00      1 (DependencyNeverSatisfied)
          17777213 defq,long OSstatus a1234567 PD       0:00      1 (DependencyNeverSatisfied)
$

Common state codes are:

R - Job is running
PD - Job is waiting / pending

Full information on the squeue command, and possible status codes, can be obtained while logged in to our HPC facilities with the man command:

$ man squeue

https://slurm.schedmd.com/squeue.html

sinfo

The sinfo command shows you the general status of the HPC resources; the job partitions, which nodes are available in each partition as well as the default runtime limits for each.

Each line of output from sinfo shows the status of one or more hosts against a partition:

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*          up 2-00:00:00      1  down* sb044
defq*          up 2-00:00:00     31    mix sb[008,012-015,024-025,027,031,040-041]
defq*          up 2-00:00:00     78  alloc sb[001-007,009-011,016-023,026,028-030,032-039]
short          up      10:00      1  down* sb044
short          up      10:00     37    mix ln[02-04],mb01,sb[008,012-015,024-025,027,031,040-041],xln[01-02]
short          up      10:00     78  alloc sb[001-007,009-011,016-023,026,028-030,032-039,042-043,045-047]
short          up      10:00      6   idle ln01,mb02,mn[01-04]
long           up 30-00:00:0      1  down* sb044
long           up 30-00:00:0     31    mix sb[008,012-015,024-025,027,031,040-041,048,063-064,070-071,076,081,083-087,090-096,106]
long           up 30-00:00:0     78  alloc sb[001-007,009-011,016-023,026,028-030,032-039,042-043,045-047,049-062,065-069]
interactive    up 1-00:00:00     37    mix ln[02-04],mb01,sb[008,012-015,024-025,027,031,040-041,048,063-064,070-071,076],xln[01-02]
interactive    up 1-00:00:00     78  alloc sb[001-007,009-011,016-023,026,028-030,032-039,042-043,045-047,049-062,065-069,072-075,]
interactive    up 1-00:00:00      6   idle ln01,mb02,mn[01-04]
bigmem         up 14-00:00:0      6    mix ln[02-04],mb01,xln[01-02]
bigmem         up 14-00:00:0      6   idle ln01,mb02,mn[01-04]
$

Looking at the output above, it shows that for the short partition we have the following:

One node down (state DOWN sb044)
A large number of nodes have some jobs running (state MIX)
Most nodes are running jobs and are nearly fully allocated (state ALLOC)
Several nodes are all idle/without jobs (state IDLE ln01, mb02, mn01-mn04)

This can be useful when scheduling jobs, or deciding on a particular partition/resource type to use.

Full information on the sinfo command can be obtained while logged in to our HPC facilities with the man command:

$ man sinfo

https://slurm.schedmd.com/sinfo.html

sacct

The sacct command is superficially similar to squeue, but is primarily used to retrieve data on historic jobs. You can extract information from previous jobs to understand how they ran and what resources they used, possibly as a means to improve the effectiveness of future jobs.

By default, sacct will show all jobs for the current user (including finished, waiting and running), in the current time window (which is approximately equivalent to the current day). With no other parameters the output shows basic information about the job (job ID, group name, number of nodes, job status):

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
17784183       T2_Decay       long    mygroup         88    TIMEOUT      0:0 
17784183.ba+      batch               mygroup         44  CANCELLED     0:15 
17784183.0   Foucault.+               mygroup         88  CANCELLED     0:15 
17794687_2   Artifi_Mo+       long    mygroup          8    RUNNING      0:0 
17794687_2.+      batch               mygroup          8    RUNNING      0:0 
17794688_1   Artifi_Mo+       long    mygroup          8    RUNNING      0:0 
17794688_1.+      batch               mygroup          8    RUNNING      0:0 
17794688_2   Artifi_Mo+       long    mygroup          8    RUNNING      0:0 
17794688_2.+      batch               mygroup          8    RUNNING      0:0 
17851999     RESPONSE_+       long    mygroup          1    RUNNING      0:0 
17851999.ba+      batch               mygroup          1    RUNNING      0:0 
17854203       job_3_FC       long    mygroup         32    RUNNING      0:0 
17854203.ba+      batch               mygroup         32    RUNNING      0:0 
17854203.0   interIsoF+               mygroup         32    RUNNING      0:0 
17854204       job_2_FC       long    mygroup         32    RUNNING      0:0
$

A common option is to limit the results by job status. Here's an example which reports on jobs which failed due to insufficient memory allocation in the last 7 days:

$ sacct --state=OOM --starttime now-7days --endtime now
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
17789551_142   QTABmelt       defq     mygroup         4 OUT_OF_ME+    0:125 
17789551_14+      batch                mygroup         4 OUT_OF_ME+    0:125 
17789551_144   QTABmelt       defq     mygroup         4 OUT_OF_ME+    0:125 
17789551_14+      batch                mygroup         4 OUT_OF_ME+    0:125 
17789551_153   QTABmelt       defq     mygroup         4 OUT_OF_ME+    0:125
$

You may then choose to get more detailed information on one of those jobs which failed due to OUT_OF_MEMORY by using the -j (job ID) and -l (long output) parameters:

$ sacct -j 17789551 -l
JobID     JobIDRaw    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU  
 NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask  
 AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsage
OutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -
------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -
------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- ---------
------ ---------------
17789551_43+ 17789992.ba+      batch               142804K          sb099              0    142804K   3494420K      sb099          0   3494420K        0        sb099              0          0   02:44:08      sb099          0   02:44:08  
      1          4   00:45:30  COMPLETED      0:0        74K             0             0             0     2500Mc              0        1.16M           sb099               0          1.16M        0.00M            sb099                0  
        0.00M            cpu=4,mem+ cpu=02:44:08,+ cpu=02:44:08,+ cpu=sb099,energy=+ cpu=0,fs/disk=0,m+ cpu=02:44:08,+ cpu=sb099,energy=+ cpu=0,fs/disk=0,m+ cpu=02:44:08,+ energy=0,fs/di+ energy=sb099,fs/di+           fs/disk=0 energy=0,
fs/di+ energy=0,fs/di+ 
17789551_435 17789993       QTABmelt       defq                                                                                                                                                                                              
                 4   02:42:35 OUT_OF_ME+    0:125                  Unknown       Unknown       Unknown     2500Mc                                                                                                                            
              billing=4+ billing=4+                                                                                                                                                                                                          
                       
17789551_43+ 17789993.ba+      batch               142804K          sb099              0    142804K  10220208K      sb099          0  10220208K        0        sb099              0          0   09:56:26      sb099          0   09:56:26  
      1          4   02:42:35 OUT_OF_ME+    0:125        12K             0             0             0     2500Mc              0        1.16M           sb099               0          1.16M        0.00M            sb099                0  
        0.00M            cpu=4,mem+ cpu=09:56:26,+ cpu=09:56:26,+ cpu=sb099,energy=+ cpu=0,fs/disk=0,m+ cpu=09:56:26,+ cpu=sb099,energy=+ cpu=0,fs/disk=0,m+ cpu=09:56:26,+ energy=0,fs/di+ energy=sb099,fs/di+           fs/disk=0 energy=0,
fs/di+ energy=0,fs/di+ 
17789551_436 17789997       QTABmelt       defq                                                                                                                                                                                              
                 4   00:01:21  COMPLETED      0:0                  Unknown       Unknown       Unknown     2500Mc                                                                                                                            
              billing=4+ billing=4+
$

Parsing sacct output

If you want to parse the above long-format output in a script, or paste into a spreadsheet, then adding the –parsable option will embed a '|' character between each field so that the output is more easily parsed by column.

Full information on the sacct command can be obtained while logged in to our HPC facilities with the man command:

$ man sacct

https://slurm.schedmd.com/sacct.html

scancel

The scancel command allows you to remove pending jobs from the scheduler queue, as well as request any running job to be stopped.

First, find the Job ID of the process you want to remove from the queue, or stop (replace n1234 with your normal University IT account):

$ squeue -u n1234
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18020902      long 13_alcoh    n1234  R       4:34      1 sb027
          18015794      long 13_SN2_1    n1234  R 1-02:18:49      1 sb094
          18015793      long 13_E2_2_    n1234  R 1-02:35:24      1 sb110
          18018643      long dynamics    n1234  R   22:21:16      1 sb100
$

In this case, we choose to cancel the running (state R) job 18018643:

$ scancel 18018643
$

Note In most use cases you are only able to cancel jobs which you have started.

Full information on the scancel command can be obtained while logged in to our HPC facilities with the man command:

$ man scancel

https://slurm.schedmd.com/scancel.html

Back to Getting Started

Table of Contents