Table of Contents

Your First Slurm Job

Job Files / Slurm SBATCH Files

Job files are just text files. They can be created with any text editor (but not a word processor like Word). The job file instructs the HPC scheduler what hardware your job requires and contains the commands that your job runs to run your computation workload.

The scheduler takes your job file and looks for the most suitable place to run your commands, from across the available hardware in the HPC facility. The choices you make and the options you set in your job file are used by the scheduler to better inform its decision on when and where to run your job.

As noted, with an HPC system you do not normally interact directly with the server or servers which are used to run your code or application; the job file describes what you need, and it is up to the scheduler to run it on your behalf.

The actual code you want to run might be some custom Python or R, or it may involve starting up Matlab or Ansys to run a pre-defined solution. Whatever you intend to run, the job script is the way you do this.

In 99.9% of cases you will need a job file to tell the scheduler where to find your job, what resources you need, and how to run it. As the job file is also code, we would recommend that you use software version control tools, such as Git / Github.com, to store your job files; especially if your job files are being shared amongst others in your HPC project group.

The Basics

First, log on to the HPC facility.

All of your job files start the same way, in your text editor (for example nano) you create a new file (let us call it firstjob.sh) and add this line at the very top:

#!/bin/bash

This tells the system that when the scheduler actually runs your job, it will run by the bash command - a built in utility on Linux systems. The bash tool allows for simple scripting and flow control and is an easy way to get started writing scripts and chaining together sequences of actions on Linux.

After you have added the #!/bin/bash header line, you then add the commands which you want your job to run. In this case we are going to:

Let us run nano firstjob.sh and enter the following simple job:

#!/bin/bash
#SBATCH --account=myhpcgroup

date
echo "We are running on $HOSTNAME"
date

For the –account line, make sure that you change myhpcgroup to the name of the HPC project group you are a member of (e.g. rockhpc_abc23 or comethpc_xyz23).

You must always use a valid account name in your job files, as the system needs to know which project to allocate your resource use to. Unlike earlier HPC facilities, our current systems require that you specify your account. The –account field is no longer an optional parameter.

Save the file as firstjob.sh and exit nano.

Running The Job

We send the job to the scheduler using the sbatch command and the name of the job file. In the example above we use sbatch firstjob.sh:

$ sbatch firstjob.sh
Submitted batch job 134567899
$

Okay, so the job is now submitted, but where did it go, and how do you see what your code did?

Well, we need to understand that running jobs on an HPC are often somewhat different to those same scripts and code running on a local compute:

Instead, all output that your job would have sent to the screen is captured and instead sent to a text log file slurm-<job_ID>.out, where job_ID is replaced by the number which sbatch indicated when you submitted the job. In this example we would find that a new file named slurm-134567899.out has been created.

The .out file may not contain any contain anything at this time, as the scheduler still needs to find a free space across the HPC facility to run your job. There are likely other jobs running on the HPC, and we have to wait our turn.

You can monitor the status of any jobs you submit by running the squeue –me command.

At some point your job will run, and once completed the slurm-134567899.out file will contain anything printed to the screen during that run. In the case of the simple job script we type above, something along the lines of this should be shown:

$ cat slurm-134567899.out
Mon 31 Mar 14:03:45 BST 2025
We are running on compute37.hpc
Mon 31 Mar 14:03:46 BST 2025
$


Resources

When we submitted our test job script, how did the HPC scheduler know how much RAM or CPU cores (both referred to as job resources) to allocate to it, or how long it was going to run?

It didn't.

Whilst the scheduler software is quite sophisticated and can manage thousands of jobs running across many dozens or hundreds of servers, it has no means of magically identifying how much memory or processor power that your application needs to run. In our example above your job ran with the default resource allocation - we are fortunate that such a basic job can run with the default resource values.

The basic resource requirements of a job are:

The resources of our HPC facilities are grouped into logical containers called partitions. These groupings are used to gather together servers of similar capabilities and performance. Each partition type has (or may have) a different default resource allocation

Calculating Your Resource Use

All Slurm jobs cost resources. Some of the hardware resources we make available are free, whereas some are reserved for paying projects. It is important to understand that the resources you request in your Slurm job directly impact the costs incurred by your project.

Our Methodology

When we calculate how many resources a Slurm job has used we use the following mechanism:

Number of resources * Job time in hours = Hours of Compute Resources

CPU Based Slurm Jobs

In the case of a Slurm job which only uses CPU resources, this becomes:

Total number of CPU cores * Hours = Total Hours of CPU Compute Resource

GPU Based Slurm Jobs

In the case of a Slurm job using GPU resources, the calculation is:

Total number of GPU cards * Hours = Total Hours of GPU Compute Resource

Note, as above, RAM / Memory is not a factor in the cost of your Slurm jobs.

Read on to understand how to change the basic resources allocated to your Slurm job; the most common resources you will need to request are CPU cores, Memory and job time:

Memory

We can adjust the amount of RAM the scheduler will allocate us using the –mem parameter in our job file:

#SBATCH --mem=

The default values are interpreted as Megabytes, so –mem=1000M and –mem=1000 are identical. You may use the optional suffixes K, M, G and T for Kilobytes, Megabytes, Gigabytes and Terabytes, respectively.

While we don't directly include the amount of memory your job requires as part of our costing methodology, it is still important for Slurm to know how much your job needs, so that it can find the right space on the right server for your job to start.

CPU Cores

We can adjust the number of CPU cores the scheduler will allocate us using the following parameter in our job file with –cpus-per-task:

#SBATCH --cpus-per-task=

Or, use the shorthand -c:

#SBATCH -c

Time

We can adjust the amount of time we want the scheduler to allocate to our job using the –time parameter in our job file:

#SBATCH --time=

The shorthand version is just -t:

#SBATCH -t

The format is hh:mm:ss, or days-hh:mm:ss.

The amount of time your job uses only starts at the point the scheduler begins to run it - it does not include the time your job may spend waiting to start.

When the resource cost of your job is calculated it is only the elapsed time the job was running which is used, not the time you requested.

If you request 4 hours, but the job only runs for 2 hours, then your resource cost is based on 2 hours, not 4.

Second Job

Now that we know what the basic resource 'building blocks' of a job script are, let us write a second job, this time adding a number of explicit resource requests:

#!/bin/bash
#SBATCH --account=myhpcgroup
#SBATCH --mem=1000M
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00

date
echo "We are running on $HOSTNAME"
date

Submit the job as before:

$ sbatch secondjob.sh
Submitted batch job 154543891
$

Wait for the job to run (check via squeue –me) and look for the output in slurm-154543891.out:

$ cat slurm-154543891.out
Mon 31 Mar 14:40:44 BST 2025
We are running on compute09.hpc
Mon 31 Mar 14:40:46 BST 2025
$

So what was the difference?

Although in this limited case there appears to be no visible difference, by explicitly requesting a different amount of CPU cores, RAM and runtime the scheduler had more information available to make an informed choice about where to run our job. It also allowed the scheduler to allocate the specific resources our job needed - before we were just guessing that the scheduler gave us enough!

In a trivial case such as this, this likely makes no significant difference, but what if we had a big job requiring hundreds of Gigabytes of RAM, or a hundred CPU cores, or there were already hundreds of other jobs running on the same server?

If you do not request the right amount of resources for your job, then you may end up in one of the following scenarios:

It may seem harsh to have your job terminated, but the scheduler is working on behalf of all users of the facility - and a job which is trying to use more resources than the scheduler expected may cause many other users to be negatively impacted.

More About Resources

To make a more informed decision about how to write your job scripts to take advantage of our different partition types, please now consult our Resources and Partitions section.


Back to Getting Started