Table of Contents

Gathering Troubleshooting Information

For troubleshooting either on your own or with assistance from others, you will want to know “What, Where, When, and How” an error occurred. It's a good idea to copy and paste from the terminal, to capture the exact time, what you were doing, the hostname and current directory.

Where and When?

Edit the shell environment configuration file (for example , to configure bash, add to the ~/.bashrc file):

export PS1="[\d \t \u@\h:\w ] $ "                                                                                # shows date, time, host and current directory in your prompt
HISTTIMEFORMAT="%d/%m/%y %T "                                                                                    # adds timestamps to your history

Alternatively, gather this information by running commands:

Context: What and How?

Always provide any scripts you were using. Where there is a directory containing many relevant scripts and data files, you can use:

Records of previous slurm jobs

If some jobs work and others don't, it can be handy to look at differences in the resources they used, maybe they didn't use the resources you expected!

for a single job numbered 1000667:

sacct --jobs 1000667 --format=User,JobID,Jobname%50,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
to ouput information on all your jobs, leave out the –jobs option:
sacct --format=User,JobID,Jobname%50,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist