This page is intended to act as a timeline of events for the Comet HPC project, as well as major changes in functionality or policies relating to the system.
The vendor has restored the Comet login service (e.g. ssh comet.hpc.ncl.ac.uk). Unfortunately it appears that the SSH host key fingerprints for one of the login servers have been lost.
ssh comet.hpc.ncl.ac.uk
If you attempt to log in to Comet now you will see a warning from your SSH client looking like this:
$ ssh comet.hpc.ncl.ac.uk @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is SHA256:ABCDEFGHIJKLMNOPQRTU12345678. Please contact your system administrator. $
This is expected, since the original fingerprints of that server have now changed. To resolve this one-time issue, run the following command on your Linux or Mac OS device:
$ ssh-keygen -f $HOME/.ssh/known_hosts -R comet.hpc.ncl.ac.uk
If you are using an alternative SSH client on a Windows platform, the error message from your software (e.g. PuTTY, Mobaxterm or similar) should indicate the equivalent command to run.
An issue has been identified with the Comet login nodes. The usual access method of ssh comet.hpc.ncl.ac.uk or ssh comet.ncl.ac.uk is not working.
ssh comet.ncl.ac.uk
Please use the direct connection alternatives: ssh cometlogin02.comet.hpc.ncl.ac.uk or ssh cometlogin01.comet.hpc.ncl.ac.uk to temporarily bypass the faulty configuration - the incident has been raised with our HPC vendor who are attempting to fix as a priority this afternoon.
ssh cometlogin02.comet.hpc.ncl.ac.uk
ssh cometlogin01.comet.hpc.ncl.ac.uk
Following the unintended configuration of Slurm priorities and pre-emption rules we have requested that our HPC vendor make the following changes to the operation of Comet:
In addition, we are taking the opportunity to make a better distribution of the available compute node resources. The following changes will be made to partitions:
hmem009
hmem010
hmem001-004
hmem006-008
The design intention of Comet was to put most of our compute resource into standard compute nodes (i.e. not low-latency), as the historical data from operating Rocket indicated most of our workloads fit into that classification. However we do have some users who need large scale parallel jobs, and that is what the low-latency_paid partition is for. Since we don't run low latency jobs most of the time we wanted the ability to use that resource when it is not being used for its original purpose.
The job pre-emption rules allow for this, and the specification as set for Comet stated:
Spare capacity in the low-latency_paid partition can be used by default_paid jobs to prevent it sitting idle and allow for 'burst' expansion of the default_paid partition… but this capacity must be evacuated and priority given over if a job is submitted to the low-latency_paid partition which would require them. Jobs submitted to short_paid and long_paid are not subject to this configuration, neither are jobs submitted to any of the free partitions.
This does mean that if the default_paid partition is at capacity, your job may run on extra capacity provided by low-latency_paid, but it is in danger of being stopped/rescheduled if a real low-latency job is submitted. You should always consider adding checkpointing to your jobs to allow resuming from a service interruption.
The work to make the changes outlined above will be carried out on the morning of Wednesday 11th of February, at 9:00am. A resource reservation window has been put in place to prevent any jobs from running/starting during this piece of work. We expect the change to be implemented within a few minutes, but it involves a restart/reload of the Slurm service on each node, so we do not want to risk causing faults with running jobs at that time.
A number of users of Comet have noticed that certain jobs (mainly default_free and long_free, but also seen sporadically elsewhere) have been unexpectedly stopped, paused and rescheduled, even after running for many hours or several days.
This is not expected behaviour and we do not envisage job pre-emption based on any priority levels for the vast majority of Comet. The only area where this is part of the design specification are the nodes which make up the low-latency partition. If these are idle, then they may take up extra load from the default_paid job queue, to prevent them from being under-utilised. This should not be in place anywhere else, so this is not expected behaviour.
We are working with the HPC support vendor to understand why jobs outside of the low-latency partition are being stopped and rescheduled, as this is clearly a waste of compute time for those affected jobs. Once the cause is identified and a solution designed we will update you on the timeline to get this resolved.
The following new software has been added by our HPC support vendor:
module
module load GROMACS/2026-cuda
module load GROMACS/2026-opencl
module load Miniforge
module load UDUNITS
module load GDAL
module load LAPACK
The list of all software requests can be found on the software page for Comet.
Now that Comet is coming into heavy usage, some new issues have emerged from the end of last week and into this week.
Apologies to those of you who have experienced these and thanks to you all for your patience and for continuing to let us know about problems as they occur (email https://hpc.researchcomputing.ncl.ac.uk/ or log a ticket at NUService).
Currently, all nodes have both /tmp and /scratch directories (on the node internal fast NVMe drive). /scratch is a very large partition intended for working temporary files. Unfortunately, many applications have been attempting to use the much smaller /tmp directory.
/tmp
/scratch
We have seen the /tmp directory on compute nodes filling up, sometimes leading to job fails with error messages relating to failing to create temporary files, as well as more obscure error messages.
Working with our supplier, OCF, we have asked for:
TMPDIR
These changes should not affect any jobs, running jobs are allowed to complete but new jobs are not sent to nodes in 'drain'. However it will take some days to complete this change on all nodes.
Various issues have been reported with Open OnDemand VNC desktop, RStudio and Matlab sessions. Most commonly, sessions have failed to start but immediately jumped to 'completed'
The issues have been tracked down to node compute030, which is the first node 'in line' for free sessions in Open OnDemand.
compute030
compute030 has been put into 'drain' so that once running jobs have completed it can be rebuilt.
In the meantime, other nodes are now picking up new Open OnDemand sessions and we've had no further reports of issues.
Please do email us at hpc.researchcomputing@newcastle.ac.uk if you notice problems on Comet, even things like missing libraries, which you might have dealt with by local installs on Rocket. We can't promise to fix everything centrally but we do aim to have Comet's core software operating properly.
Back to HPC Documentation Home
Table of Contents
Main Content Sections
Documentation Tools