Slurm Job Checkpointing & Restarting Jobs

Slurm Job Checkpointing & Restarting Jobs

Job checkpointing is supported for most types of single-node jobs on Comet using DMTCP.

This allows a job to be suspended or stopped and then resume from that point in the future without the loss of data - this may be especially useful for jobs which run for many days. Enabling checkpoints allows you to resume your job if it crashes, or if the compute resources it is using become unavailable for some reason.

This documentation page shows the typical use of checkpointing for a simple, single-node job ran via Slurm.

Prerequisites

No special user permissions are required
No special storage permissions are required (checkpoint files are written to where you run your Slurm job from)
A small amount of code is needed in your Slurm sbatch files
You need an sbatch file to start your job…
… and a second sbatch file to resume your job later

Sbatch Script Changes

To support job checkpointing, your sbatch files must have a few modifications made.

To Start A New Sbatch Job

The sbatch file to start a new job should have the DMTCP boilerplate section added - this sets up the checkpointing service (dmtcp_coordinator) for you. After that boilerplate you then load any software modules as normal, and simply prefix the command to start your analysis/Python/R/Java/linux binaries with dmtcp_launch as shown in the example:

#!/bin/bash
#SBATCH -p default_free
#SBATCH -c 4
#SBATCH -t 08:00:00

###############################
# This is standard DMTCP boilerplate - cut and paste
# without modifying this section
# Get the next free port for DMTCP
get_unused_port() {
    while true; do
        LISTENING_PORTS=$(cat /proc/net/tcp | awk 'NR >1 {print $2}' | awk -F':' '{print $2}');
        LISTENING_PORTS=$(for PORT in ${LISTENING_PORTS}; do echo $((16#${PORT})); done|sort -g);
        read LPORT UPORT < /proc/sys/net/ipv4/ip_local_port_range
        MPORT=$[$LPORT + ($RANDOM % $UPORT)];
        if (echo "${LISTENING_PORTS[@]}" | grep -xqv $MPORT); then
            echo $MPORT;
            break;
        fi
    done
}
DMTCP_QUIET=1
DMTCP_COORD_PORT=`get_unused_port`
export DMTCP_COORD_PORT
# Start DMTCP and create a checkpoint every five minutes
dmtcp_coordinator -i 300 &

#######################################
# Your normal script starts here...
module load FOO
module load BAR
# Prefix your python/R/java/shell/binary with dmtcp_launch:
dmtcp_launch -j ./my_script.sh

In the example above we set a checkpoint interval of 300 seconds / 5 minutes. This is likely far too short for a real world example; consider carefully how often your job should produce a checkpoint. More frequent checkpoints allow you to start with less compute time lost, but will increase the disk space used and interrupt the compute time of your application more frequently.

To Resume A Previous Sbatch Job

The resume script is almost identical to the start script, but instead of calling your Python/R/Java/linux binaries directly, you call the dmtcp_restart_script.sh script, which is generated during the initial run of your job:

#!/bin/bash
#SBATCH -p default_free
#SBATCH -c 4
#SBATCH -t 08:00:00

###############################
# This is standard boilerplate - cut and paste
# without modifying this section
# Get the next free port for DMTCP
get_unused_port() {
    while true; do
        LISTENING_PORTS=$(cat /proc/net/tcp | awk 'NR >1 {print $2}' | awk -F':' '{print $2}');
        LISTENING_PORTS=$(for PORT in ${LISTENING_PORTS}; do echo $((16#${PORT})); done|sort -g);
        read LPORT UPORT < /proc/sys/net/ipv4/ip_local_port_range
        MPORT=$[$LPORT + ($RANDOM % $UPORT)];
        if (echo "${LISTENING_PORTS[@]}" | grep -xqv $MPORT); then
            echo $MPORT;
            break;
        fi
    done
}
DMTCP_QUIET=1
DMTCP_COORD_PORT=`get_unused_port`
export DMTCP_COORD_PORT

# Start DMTCP and create a checkpoint every five minutes
dmtcp_coordinator -i 300 &

#######################################
# Your normal script starts here...
module load FOO
module load BAR
# In the directory where your job ran is a file named 'dmtcp_restart_script.sh'
# Run it to relaunch your original application and resume from where it exited.
./dmtcp_restart_script.sh

Example

At this point you should have the following files in your directory:

start.sh - an sbatch file which will start your job
resume.sh - an sbatch file used to resume your job
my_script.sh - your job/script/application itself; in the example it is a bash script, but could be your binary, a Python file, R, Java etc.

$ ls -l
-rwxr-x--- 1 n1234 cometloginaccess     164 Sep  1 10:18 my_script.sh
-rw-r----- 1 n1234 cometloginaccess     812 Sep  1 10:19 resume.sh
-rw-r----- 1 n1234 cometloginaccess    1042 Sep  1 10:51 start.sh

Run your start.sh sbatch file:

$ sbatch start.sh 
Submitted batch job 1025707

If you look at your slurm output log you should find some log messages from dmtcp:

$ cat slurm-1025707.out 
DMTCP listen on 61951
dmtcp_coordinator starting...
    Host: compute030.comet.hpc.ncl.ac.uk (172.31.26.30)
    Port: 61951
    Checkpoint Interval: 60
    Exit on last client: 0
Type '?' for help.


[2025-09-01T10:51:33.174, 881034, 881034, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout; (theStaleTimeout = 28800);
[2025-09-01T10:51:33.274, 881034, 881034, Note] at dmtcp_coordinator.cpp:837 in initializeComputation; REASON='Resetting computation;
[2025-09-01T10:51:33.274, 881034, 881034, Note] at dmtcp_coordinator.cpp:942 in onConnect; REASON='worker connected; (hello_remote.from = 44814dcf2d3c2806-881035-bf73d4948f201); (client->progname() = ./my_script.sh);

Monitoring the same directory you should also find checkpoint files (ckpt_*) generated to the interval you specified in your sbatch file:

$ ls -l
total 9592
-rw------- 1 n1234 cometloginaccess 4893752 Sep  1 10:52 ckpt_bash_44814dcf2d3c2806-40000-bf73d4fd34dd1.dmtcp
-rw------- 1 n1234 cometloginaccess 4893618 Sep  1 10:52 ckpt_bash_44814dcf2d3c2806-44000-bf74b7cad5f67.dmtcp
-rwxr----- 1 n1234 cometloginaccess    6812 Sep  1 10:52 dmtcp_restart_script_44814dcf2d3c2806-40000-bf73d4948f201.sh
lrwxrwxrwx 1 n1234 cometloginaccess      60 Sep  1 10:52 dmtcp_restart_script.sh -> dmtcp_restart_script_44814dcf2d3c2806-40000-bf73d4948f201.sh
-rwxr-x--- 1 n1234 cometloginaccess     164 Sep  1 10:18 my_script.sh
-rw-r----- 1 n1234 cometloginaccess     812 Sep  1 10:19 resume.sh
-rw-r----- 1 n1234 cometloginaccess    7445 Sep  1 10:53 slurm-1025707.out
-rw-r----- 1 n1234 cometloginaccess    1042 Sep  1 10:51 start.sh

Location of checkpoint files

Notice that the checkpoint files are created in the directory you started the job from. You are therefore recommended to start large jobs from the Lustre (/nobackup) filesystem and not your $HOME directory.

Also present is the file dmtcp_restart_script.sh - this is dynamically generated based on the latest checkpoint file.

You can cancel your sbatch job, or allow it to time out (if it hits the runtime limit):

$ scancel 1025707

Now you may resume the job at a time of your choosing:

$ sbatch resume.sh

The job will resume from the last checkpoint file, with the contents of memory, variables etc as they were before the job was cancelled or terminated.

Tip

This also allows you to continue running Slurm jobs which may have hit a runtime limit; just restart the job and it will resume from the last checkpoint with the new runtime allowance.

Questions, Errors & Limitations

How often should I checkpoint my job?

This is difficult to answer. However, you should consider the size of your job and the frequency at which it (may be) writing data to disk or some other output. For a long running job which may needs several days to run to completion, then a checkpoint every 6-12 hours is probably more than sufficient to mitigate the loss of multiple days of processing.

However, for a job running for just a few hours, or perhaps producing data every few minutes, than perhaps a more frequent checkpoint internal would be suitable.

Checkpoint file sizes

The size of the checkpoint files produced is directly related to the in-memory size of your job, any libraries loaded and any and all active memory you are using. If your application is many tens of gigabytes in size, then each checkpoint file will be of a similar size.

Enabling checkpoints on huge jobs will impact NFS ($HOME) or Lustre (/nobackup) performance for all users of the system, including yourself.

How Can I Enable Checkpoints for MPI Jobs?

Whilst DMTCP has support for MPI processes, this is implemented by dmtcp_restart_script.sh using ssh to directly launch those processes again on the hosts they were initially running on.

This is incompatible with the use of Slurm on Comet, as we do not allow direct ssh connections to compute nodes, and restarting an MPI job may not guarantee the same pool of hosts are allocated to your job the second time. As a result, we do not currently support checkpointing MPI jobs at this time.

What To Do With Checkpoint Files?

If your job runs to completion, or you no longer need to restart the job, then the checkpoint files (ckpt_*) are safe to delete. You may also delete any dmtcp_restart_script* files.

Back to Advanced Topics index

Table of Contents