====== How Can I See Which of My Jobs Have Failed Recently? ====== You can retrieve a report on all jobs which have failed within the last 30 days using the following variation of the ''sacct'' command: $ sacct --state=F --starttime now-30days --endtime now You can adjust the time window using the ''--starttime'' and ''--endtime'' parameters and the job status with the ''--state'' option. Some common options: # All failed jobs in the previous week: $ sacct --state=F --starttime now-7days --endtime now # Jobs that ran out of memory in the last twelve hours: $ sacct --state=OOM --starttime now-12hours --endtime now # Jobs that hit a runtime timeout in the last 45 minutes: $ sacct --state=TO --starttime now-45minutes --endtime now For reference, the common Slurm job status codes include: ^ Short Code ^ Long Code ^ Description ^ | BF | BOOT_FAIL | Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). | | CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. | | CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. | | DL | DEADLINE | Job terminated on deadline. | | F | FAILED | Job terminated with non-zero exit code or other failure condition. **This means your code failed, or the application you ran exited with an error.** | | NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. | | OOM | OUT_OF_MEMORY | Job experienced out of memory error. | | PD | PENDING | Job is awaiting resource allocation; it has not started. | | PR | PREEMPTED | Job terminated due to preemption. | | R | RUNNING | Job currently has an allocation. | | RQ | REQUEUED | Job was requeued. | | RS | RESIZING | Job is about to change size. | | RV | REVOKED | Sibling was removed from cluster due to other cluster starting the job. | | S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. | | TO | TIMEOUT | Job terminated upon reaching its time limit. | ---- [[:faq:index|Back to FAQ]]