How Can I See Which of My Jobs Have Failed Recently?

You can retrieve a report on all jobs which have failed within the last 30 days using the following variation of the sacct command:

$ sacct --state=F --starttime now-30days --endtime now

You can adjust the time window using the –starttime and –endtime parameters and the job status with the –state option. Some common options:

# All failed jobs in the previous week: 
$ sacct --state=F --starttime now-7days --endtime now

# Jobs that ran out of memory in the last twelve hours: 
$ sacct --state=OOM --starttime now-12hours --endtime now

# Jobs that hit a runtime timeout in the last 45 minutes: 
$ sacct --state=TO --starttime now-45minutes --endtime now

For reference, the common Slurm job status codes include:

Short Code	Long Code	Description
BF	BOOT_FAIL	Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
CA	CANCELLED	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD	COMPLETED	Job has terminated all processes on all nodes with an exit code of zero.
DL	DEADLINE	Job terminated on deadline.
F	FAILED	Job terminated with non-zero exit code or other failure condition. This means your code failed, or the application you ran exited with an error.
NF	NODE_FAIL	Job terminated due to failure of one or more allocated nodes.
OOM	OUT_OF_MEMORY	Job experienced out of memory error.
PD	PENDING	Job is awaiting resource allocation; it has not started.
PR	PREEMPTED	Job terminated due to preemption.
R	RUNNING	Job currently has an allocation.
RQ	REQUEUED	Job was requeued.
RS	RESIZING	Job is about to change size.
RV	REVOKED	Sibling was removed from cluster due to other cluster starting the job.
S	SUSPENDED	Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO	TIMEOUT	Job terminated upon reaching its time limit.