====== Slurm Job Checkpointing & Restarting Jobs ======
Job checkpointing is supported for most types of single-node jobs on Comet using [[https://github.com/dmtcp/dmtcp|DMTCP]].
This allows a job to be suspended or stopped and then resume from that point in the future without the loss of data - this may be especially useful for jobs which run for many days. Enabling checkpoints allows you to resume your job if it crashes, or if the compute resources it is using become unavailable for some reason.
This documentation page shows the typical use of checkpointing for a //simple//, //single-node// job ran via Slurm.
== Prerequisites ==
* No special user permissions are required
* No special storage permissions are required (checkpoint files are written to where you run your Slurm job from)
* A small amount of code is needed in your Slurm sbatch files
* You need an sbatch file to start your job...
* ... and a second sbatch file to resume your job later
===== Sbatch Script Changes =====
To support job checkpointing, your sbatch files must have a few modifications made.
=== To Start A New Sbatch Job ===
The sbatch file to start a new job should have the DMTCP boilerplate section added - this sets up the checkpointing service (''dmtcp_coordinator'') for you. After that boilerplate you then load any software modules as normal, and simply prefix the command to start your analysis/Python/R/Java/linux binaries with ''dmtcp_launch'' as shown in the example:
#!/bin/bash
#SBATCH -p default_free
#SBATCH -c 4
#SBATCH -t 08:00:00
###############################
# This is standard DMTCP boilerplate - cut and paste
# without modifying this section
# Get the next free port for DMTCP
get_unused_port() {
while true; do
LISTENING_PORTS=$(cat /proc/net/tcp | awk 'NR >1 {print $2}' | awk -F':' '{print $2}');
LISTENING_PORTS=$(for PORT in ${LISTENING_PORTS}; do echo $((16#${PORT})); done|sort -g);
read LPORT UPORT < /proc/sys/net/ipv4/ip_local_port_range
MPORT=$[$LPORT + ($RANDOM % $UPORT)];
if (echo "${LISTENING_PORTS[@]}" | grep -xqv $MPORT); then
echo $MPORT;
break;
fi
done
}
DMTCP_QUIET=1
DMTCP_COORD_PORT=`get_unused_port`
export DMTCP_COORD_PORT
# Start DMTCP and create a checkpoint every five minutes
dmtcp_coordinator -i 300 &
#######################################
# Your normal script starts here...
module load FOO
module load BAR
# Prefix your python/R/java/shell/binary with dmtcp_launch:
dmtcp_launch -j ./my_script.sh
In the example above we set a checkpoint interval of **300** seconds / 5 minutes. **This is likely far too short for a real world example**; consider __carefully__ how often your job should produce a checkpoint. More frequent checkpoints allow you to start with less compute time lost, but will //increase// the disk space used and //interrupt// the compute time of your application more frequently.
=== To Resume A Previous Sbatch Job ===
The resume script is almost identical to the start script, but instead of calling your Python/R/Java/linux binaries directly, you call the ''dmtcp_restart_script.sh'' script, which is //generated during the initial run of your job//:
#!/bin/bash
#SBATCH -p default_free
#SBATCH -c 4
#SBATCH -t 08:00:00
###############################
# This is standard boilerplate - cut and paste
# without modifying this section
# Get the next free port for DMTCP
get_unused_port() {
while true; do
LISTENING_PORTS=$(cat /proc/net/tcp | awk 'NR >1 {print $2}' | awk -F':' '{print $2}');
LISTENING_PORTS=$(for PORT in ${LISTENING_PORTS}; do echo $((16#${PORT})); done|sort -g);
read LPORT UPORT < /proc/sys/net/ipv4/ip_local_port_range
MPORT=$[$LPORT + ($RANDOM % $UPORT)];
if (echo "${LISTENING_PORTS[@]}" | grep -xqv $MPORT); then
echo $MPORT;
break;
fi
done
}
DMTCP_QUIET=1
DMTCP_COORD_PORT=`get_unused_port`
export DMTCP_COORD_PORT
# Start DMTCP and create a checkpoint every five minutes
dmtcp_coordinator -i 300 &
#######################################
# Your normal script starts here...
module load FOO
module load BAR
# In the directory where your job ran is a file named 'dmtcp_restart_script.sh'
# Run it to relaunch your original application and resume from where it exited.
./dmtcp_restart_script.sh
===== Example =====
At this point you should have the following files in your directory:
* ''start.sh'' - an sbatch file which will start your job
* ''resume.sh'' - an sbatch file used to resume your job
* ''my_script.sh'' - your job/script/application itself; in the example it is a bash script, but could be your binary, a Python file, R, Java etc.
$ ls -l
-rwxr-x--- 1 n1234 cometloginaccess 164 Sep 1 10:18 my_script.sh
-rw-r----- 1 n1234 cometloginaccess 812 Sep 1 10:19 resume.sh
-rw-r----- 1 n1234 cometloginaccess 1042 Sep 1 10:51 start.sh
Run your ''start.sh'' sbatch file:
$ sbatch start.sh
Submitted batch job 1025707
If you look at your slurm output log you should find some log messages from dmtcp:
$ cat slurm-1025707.out
DMTCP listen on 61951
dmtcp_coordinator starting...
Host: compute030.comet.hpc.ncl.ac.uk (172.31.26.30)
Port: 61951
Checkpoint Interval: 60
Exit on last client: 0
Type '?' for help.
[2025-09-01T10:51:33.174, 881034, 881034, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout; (theStaleTimeout = 28800);
[2025-09-01T10:51:33.274, 881034, 881034, Note] at dmtcp_coordinator.cpp:837 in initializeComputation; REASON='Resetting computation;
[2025-09-01T10:51:33.274, 881034, 881034, Note] at dmtcp_coordinator.cpp:942 in onConnect; REASON='worker connected; (hello_remote.from = 44814dcf2d3c2806-881035-bf73d4948f201); (client->progname() = ./my_script.sh);
Monitoring the same directory you should also find checkpoint files (''ckpt_*'') generated to the interval you specified in your sbatch file:
$ ls -l
total 9592
-rw------- 1 n1234 cometloginaccess 4893752 Sep 1 10:52 ckpt_bash_44814dcf2d3c2806-40000-bf73d4fd34dd1.dmtcp
-rw------- 1 n1234 cometloginaccess 4893618 Sep 1 10:52 ckpt_bash_44814dcf2d3c2806-44000-bf74b7cad5f67.dmtcp
-rwxr----- 1 n1234 cometloginaccess 6812 Sep 1 10:52 dmtcp_restart_script_44814dcf2d3c2806-40000-bf73d4948f201.sh
lrwxrwxrwx 1 n1234 cometloginaccess 60 Sep 1 10:52 dmtcp_restart_script.sh -> dmtcp_restart_script_44814dcf2d3c2806-40000-bf73d4948f201.sh
-rwxr-x--- 1 n1234 cometloginaccess 164 Sep 1 10:18 my_script.sh
-rw-r----- 1 n1234 cometloginaccess 812 Sep 1 10:19 resume.sh
-rw-r----- 1 n1234 cometloginaccess 7445 Sep 1 10:53 slurm-1025707.out
-rw-r----- 1 n1234 cometloginaccess 1042 Sep 1 10:51 start.sh
**Location of checkpoint files**
Notice that the checkpoint files are created in the directory you **started the job from**. You are therefore recommended to start large jobs from the Lustre (''/nobackup'') filesystem and not your ''$HOME'' directory.
Also present is the file ''dmtcp_restart_script.sh'' - this is dynamically generated based on the latest checkpoint file.
You can cancel your sbatch job, or allow it to time out (if it hits the runtime limit):
$ scancel 1025707
Now you may resume the job at a time of your choosing:
$ sbatch resume.sh
The job will resume from the last checkpoint file, with the contents of memory, variables etc as they were before the job was cancelled or terminated.
**Tip**
This also allows you to continue running Slurm jobs which may have hit a runtime limit; just restart the job and it will resume from the last checkpoint with the new runtime allowance.
===== Questions, Errors & Limitations =====
=== How often should I checkpoint my job? ===
This is difficult to answer. However, you should consider the size of your job and the frequency at which it (may be) writing data to disk or some other output. For a long running job which may needs several days to run to completion, then a checkpoint every 6-12 hours is probably more than sufficient to mitigate the loss of multiple days of processing.
However, for a job running for just a few hours, or perhaps producing data every few minutes, than perhaps a more frequent checkpoint internal would be suitable.
**Checkpoint file sizes**
The size of the checkpoint files produced is directly related to the in-memory size of your job, any libraries loaded and any and all active memory you are using. If your application is many tens of gigabytes in size, then each checkpoint file will be of a similar size.
Enabling checkpoints on //huge// jobs will impact NFS ($HOME) or Lustre (/nobackup) performance for all users of the system, including yourself.
=== How Can I Enable Checkpoints for MPI Jobs? ===
Whilst DMTCP has support for MPI processes, this is implemented by ''dmtcp_restart_script.sh'' using ssh to directly launch those processes again on the hosts they were initially running on.
This is **incompatible** with the use of Slurm on Comet, as we do not allow direct ssh connections to compute nodes, and restarting an MPI job may not guarantee the same pool of hosts are allocated to your job the second time. As a result, we **do not currently support checkpointing MPI jobs at this time**.
=== What To Do With Checkpoint Files? ===
If your job runs to completion, or you no longer need to restart the job, then the checkpoint files (''ckpt_*'') are safe to delete. You may also delete any ''dmtcp_restart_script*'' files.
----
[[:advanced:index|Back to Advanced Topics index]]