A number of users of Comet have noticed that certain jobs (mainly default_free and long_free, but also seen sporadically elsewhere) have been unexpectedly stopped, paused and rescheduled, even after running for many hours or several days.
This is not expected behaviour and we do not envisage job pre-emption based on any priority levels for the vast majority of Comet. The only area where this is part of the design specification are the nodes which make up the low-latency partition. If these are idle, then they may take up extra load from the default_paid job queue, to prevent them from being under-utilised. This should not be in place anywhere else, so this is not expected behaviour.
We are working with the HPC support vendor to understand why jobs outside of the low-latency partition are being stopped and rescheduled, as this is clearly a waste of compute time for those affected jobs. Once the cause is identified and a solution designed we will update you on the timeline to get this resolved.
The following new software has been added by our HPC support vendor:
module
module load GROMACS/2026-cuda
module load GROMACS/2026-opencl
module load Miniforge
module load UDUNITS
module load GDAL
module load LAPACK
The list of all software requests can be found on the software page for Comet.
Now that Comet is coming into heavy usage, some new issues have emerged from the end of last week and into this week.
Apologies to those of you who have experienced these and thanks to you all for your patience and for continuing to let us know about problems as they occur (email https://hpc.researchcomputing.ncl.ac.uk/ or log a ticket at NUService).
Currently, all nodes have both /tmp and /scratch directories (on the node internal fast NVMe drive). /scratch is a very large partition intended for working temporary files. Unfortunately, many applications have been attempting to use the much smaller /tmp directory.
/tmp
/scratch
We have seen the /tmp directory on compute nodes filling up, sometimes leading to job fails with error messages relating to failing to create temporary files, as well as more obscure error messages.
Working with our supplier, OCF, we have asked for:
TMPDIR
These changes should not affect any jobs, running jobs are allowed to complete but new jobs are not sent to nodes in 'drain'. However it will take some days to complete this change on all nodes.
Various issues have been reported with Open OnDemand VNC desktop, RStudio and Matlab sessions. Most commonly, sessions have failed to start but immediately jumped to 'completed'
The issues have been tracked down to node compute030, which is the first node 'in line' for free sessions in Open OnDemand.
compute030
compute030 has been put into 'drain' so that once running jobs have completed it can be rebuilt.
In the meantime, other nodes are now picking up new Open OnDemand sessions and we've had no further reports of issues.
Please do email us at hpc.researchcomputing@newcastle.ac.uk if you notice problems on Comet, even things like missing libraries, which you might have dealt with by local installs on Rocket. We can't promise to fix everything centrally but we do aim to have Comet's core software operating properly.
Return to HPC Service Updates & Project News
Table of Contents
HPC Service
Main Content Sections
Documentation Tools