You can retrieve a report on all jobs which have failed within the last 30 days using the following variation of the sacct
command:
$ sacct --state=F --starttime now-30days --endtime now
You can adjust the time window using the –starttime
and –endtime
parameters and the job status with the –state
option. Some common options:
# All failed jobs in the previous week:
$ sacct --state=F --starttime now-7days --endtime now
# Jobs that ran out of memory in the last twelve hours:
$ sacct --state=OOM --starttime now-12hours --endtime now
# Jobs that hit a runtime timeout in the last 45 minutes:
$ sacct --state=TO --starttime now-45minutes --endtime now
For reference, the common Slurm job status codes include:
Short Code | Long Code | Description |
---|---|---|
BF | BOOT_FAIL | Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). |
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
DL | DEADLINE | Job terminated on deadline. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. This means your code failed, or the application you ran exited with an error. |
NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
OOM | OUT_OF_MEMORY | Job experienced out of memory error. |
PD | PENDING | Job is awaiting resource allocation; it has not started. |
PR | PREEMPTED | Job terminated due to preemption. |
R | RUNNING | Job currently has an allocation. |
RQ | REQUEUED | Job was requeued. |
RS | RESIZING | Job is about to change size. |
RV | REVOKED | Sibling was removed from cluster due to other cluster starting the job. |
S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |