====== How Can I See Which of My Jobs Have Failed Recently? ======
You can retrieve a report on all jobs which have failed within the last 30 days using the following variation of the ''sacct'' command:
$ sacct --state=F --starttime now-30days --endtime now
You can adjust the time window using the ''--starttime'' and ''--endtime'' parameters and the job status with the ''--state'' option. Some common options:
# All failed jobs in the previous week:
$ sacct --state=F --starttime now-7days --endtime now
# Jobs that ran out of memory in the last twelve hours:
$ sacct --state=OOM --starttime now-12hours --endtime now
# Jobs that hit a runtime timeout in the last 45 minutes:
$ sacct --state=TO --starttime now-45minutes --endtime now
For reference, the common Slurm job status codes include:
^ Short Code ^ Long Code ^ Description ^
| BF | BOOT_FAIL | Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). |
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
| DL | DEADLINE | Job terminated on deadline. |
| F | FAILED | Job terminated with non-zero exit code or other failure condition. **This means your code failed, or the application you ran exited with an error.** |
| NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
| OOM | OUT_OF_MEMORY | Job experienced out of memory error. |
| PD | PENDING | Job is awaiting resource allocation; it has not started. |
| PR | PREEMPTED | Job terminated due to preemption. |
| R | RUNNING | Job currently has an allocation. |
| RQ | REQUEUED | Job was requeued. |
| RS | RESIZING | Job is about to change size. |
| RV | REVOKED | Sibling was removed from cluster due to other cluster starting the job. |
| S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
| TO | TIMEOUT | Job terminated upon reaching its time limit. |
----
[[:faq:index|Back to FAQ]]