How Can I See Which of My Jobs Have Failed Recently?

You can retrieve a report on all jobs which have failed within the last 30 days using the following variation of the sacct command:

$ sacct --state=F --starttime now-30days --endtime now

You can adjust the time window using the –starttime and –endtime parameters and the job status with the –state option. Some common options:

# All failed jobs in the previous week: 
$ sacct --state=F --starttime now-7days --endtime now

# Jobs that ran out of memory in the last twelve hours: 
$ sacct --state=OOM --starttime now-12hours --endtime now

# Jobs that hit a runtime timeout in the last 45 minutes: 
$ sacct --state=TO --starttime now-45minutes --endtime now

For reference, the common Slurm job status codes include:

Short Code Long Code Description
BF BOOT_FAIL Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
DL DEADLINE Job terminated on deadline.
F FAILED Job terminated with non-zero exit code or other failure condition. This means your code failed, or the application you ran exited with an error.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
OOM OUT_OF_MEMORY Job experienced out of memory error.
PD PENDING Job is awaiting resource allocation; it has not started.
PR PREEMPTED Job terminated due to preemption.
R RUNNING Job currently has an allocation.
RQ REQUEUED Job was requeued.
RS RESIZING Job is about to change size.
RV REVOKED Sibling was removed from cluster due to other cluster starting the job.
S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO TIMEOUT Job terminated upon reaching its time limit.

Back to FAQ