====== HPC Service Status ======
This page is intended to act as a timeline of events for the Comet HPC project, as well as major changes in functionality or policies relating to the system.
----
===== (10th) February - Lustre write issues =====
A number of users have reported strange write issues on Lustre (''/nobackup''). These appear to manifest in text editors not being able to write content on the Lustre filesystem, but some tools like ''echo'', ''cat'' and the standard shell redirection (''>'' and ''>>'') are seemingly unaffected.
Kernel messages (.e.g. ''dmesg'') on affected nodes are showing a number of Lustre warnings and errors that have occurred this afternoon.
An incident has been raised with our vendor to assess the situation and provide a fix.
**15:18pm - Update**
Our support vendor indicates that extreme load levels on one of the Lustre storage appliances may have been the cause of this incident. The affected services have been restarted and nodes appear to be reading/writing the ''/nobackup'' filesystem normally again. We will be following up with our vendor to get a more detailed explanation of the high load scenario and to determine how to prevent this from happening again.
----
===== (10th) February 2026 - Orca now installed =====
Orca (https://www.faccts.de/orca/) is now installed on Comet and can be loaded as follows:
''module load Orca''
Please note the specific upper-case first character of the name - ''ORCA'' or ''orca'' will not load it.
----
===== (4th) February - Comet login service restored =====
The vendor has restored the Comet login service (e.g. ''ssh comet.hpc.ncl.ac.uk''). Unfortunately it appears that the SSH host key fingerprints for one of the login servers have been lost.
If you attempt to log in to Comet now you will see a __warning__ from your SSH client looking like this:
$ ssh comet.hpc.ncl.ac.uk
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:ABCDEFGHIJKLMNOPQRTU12345678.
Please contact your system administrator.
$
This is expected, since the original fingerprints of that server have now changed. To resolve this **one-time issue**, run the following command on your **Linux** or **Mac OS** device:
$ ssh-keygen -f $HOME/.ssh/known_hosts -R comet.hpc.ncl.ac.uk
If you are using an alternative SSH client on a **Windows** platform, the error message from your software (e.g. PuTTY, Mobaxterm or similar) //should// indicate the //equivalent// command to run.
----
===== (4th) February 2026 - Comet login nodes =====
An issue has been identified with the Comet login nodes. The usual access method of ''ssh comet.hpc.ncl.ac.uk'' or ''ssh comet.ncl.ac.uk'' is not working.
Please use the direct connection alternatives: ''ssh cometlogin02.comet.hpc.ncl.ac.uk'' or ''ssh cometlogin01.comet.hpc.ncl.ac.uk'' to //temporarily// bypass the faulty configuration - the incident has been raised with our HPC vendor who are attempting to fix as a priority this afternoon.
----
===== (4th) February 2026 - Planned Changes to Slurm Configuration =====
Following the unintended configuration of Slurm priorities and pre-emption rules we have requested that our HPC vendor make the following changes to the operation of Comet:
* Priority levels will be removed from all partitions - //excluding// **default_paid** and **low-latency_paid**
* Job pre-emption and rescheduling will be disabled on all partitions - //excluding// **default_paid** and **low-latency_paid**
* Sharing / over-subscription of compute node resources will be disabled on all partitions
In addition, we are taking the opportunity to make a better distribution of the available compute node resources. The following changes will be made to partitions:
* Two further compute nodes (''hmem009'' and ''hmem010'') to be added to **short_free**, **long_free** and **interactive-std_free** partitions. Giving a total of **9** compute nodes / **2304** cores across __all__ free partitions.
* Seven further compute nodes (''hmem001-004'' and ''hmem006-008'') to be added to **short_paid**, **long_paid** and **interactive-std_paid** partitions. Giving a total of **39** compute nodes / **9984** cores across all paid nodes, plus a further **4** nodes accessed from the **low-latency_paid** partition //if they are idle// for a total combined core count of 11008.
The design intention of Comet was to put most of our compute resource into standard compute nodes (i.e. //not// low-latency), as the historical data from operating Rocket indicated most of our workloads fit into that classification. However we do have some users who need large scale parallel jobs, and that is what the low-latency_paid partition is for. Since we don't run low latency jobs most of the time we wanted the ability to use that resource when it is not being used for its original purpose.
The job pre-emption rules allow for this, and the specification as set for Comet stated:
> Spare capacity in the low-latency_paid partition can be used by //default_paid// jobs to prevent it sitting idle and allow for 'burst' expansion of the default_paid partition... but this capacity //must// be evacuated and priority given over if a job is submitted to the //low-latency_paid// partition which would require them.
> Jobs submitted to short_paid and long_paid are not subject to this configuration, neither are jobs submitted to any of the free partitions.
This does mean that if the **default_paid** partition is at capacity, your job //may// run on extra capacity provided by **low-latency_paid**, but it is in danger of being stopped/rescheduled //if// a real low-latency job is submitted. You should always consider adding [[:advanced:slurm_checkpoints|checkpointing]] to your jobs to allow resuming from a service interruption.
The work to make the changes outlined above will be carried out on the morning of **Wednesday 11th of February**, at **9:00am**. A resource reservation window has been put in place to prevent any jobs from running/starting during this piece of work. We expect the change to be implemented within a few minutes, but it involves a restart/reload of the Slurm service on each node, so we do not want to risk causing faults with running jobs at that time.
----
===== (29th) January 2026 - Ongoing issues with job rescheduling and pre-emption =====
A number of users of Comet have noticed that certain jobs (mainly **default_free** and **long_free**, but also seen sporadically elsewhere) have been unexpectedly stopped, paused and rescheduled, even after running for many hours or several days.
This is **not** expected behaviour and we do not envisage job pre-emption based on any priority levels for the vast majority of Comet. The only area where this is part of the design specification are the nodes which make up the **low-latency** partition. If these are idle, then they **may** take up extra load from the **default_paid** job queue, to prevent them from being under-utilised. This should not be in place //anywhere else//, so this is not expected behaviour.
We are working with the HPC support vendor to understand why jobs outside of the low-latency partition are being stopped and rescheduled, as this is clearly a waste of compute time for those affected jobs.
Once the cause is identified and a solution designed we will update you on the timeline to get this resolved.
----
===== (29th) January 2026 - New software added =====
The following new software has been added by our HPC support vendor:
* System packages (do __not__ need to be loaded via ''module''): screen, tmux, emacs, image magick, bc
* New modules
* **Gromacs** (Nvidia CUDA / OpenCL), molecular dynamics package: load with ''module load GROMACS/2026-cuda'' or ''module load GROMACS/2026-opencl''
* **Miniforge**, conda tool configured to use the conda-forge software channels: load with ''module load Miniforge''
* **libudunits**, software library: load with ''module load UDUNITS''
* **gdal**, software library: load with ''module load GDAL''
* **lapack**, BLAS library: load with ''module load LAPACK''
* Open requests:
* Gaussian / Gauss View
* CASTEP
* Hybre
* Stata
* VS Code
The list of all software requests can be found on the [[advanced:software_list|software page for Comet]].
----
===== (22nd) January 2026 - Issues with Comet this week =====
Now that Comet is coming into heavy usage, some new issues have emerged from the end of last week and into this week.
Apologies to those of you who have experienced these and thanks to you all for your patience and for continuing to let us know about problems as they occur (email https://hpc.researchcomputing.ncl.ac.uk/ or log a ticket at NUService).
==== /tmp space on nodes ====
Currently, all nodes have both ''/tmp'' and ''/scratch'' directories (on the node internal fast NVMe drive). ''/scratch'' is a very large partition intended for working temporary files. Unfortunately, many applications have been attempting to use the much smaller /tmp directory.
We have seen the ''/tmp'' directory on compute nodes filling up, sometimes leading to job fails with error messages relating to failing to create temporary files, as well as more obscure error messages.
=== What's being done? ===
Working with our supplier, OCF, we have asked for:
* Set ''TMPDIR'' to point to the ''/scratch'' partition, so that any well-behaved application/library writes to that location instead. This has been completed.
* Requested to replace the ''/tmp'' directory with a symlink to ''/scratch''. This work must be done on each node individually, taking them gracefully out of service (drain), making the change and re-instating the node.
These changes should not affect any jobs, running jobs are allowed to complete but new jobs are not sent to nodes in 'drain'. However it will take some days to complete this change on all nodes.
==== Issues with Open OnDemand sessions ====
Various issues have been reported with Open OnDemand VNC desktop, RStudio and Matlab sessions. Most commonly, sessions have failed to start but immediately jumped to 'completed'
The issues have been tracked down to node ''compute030'', which is the first node 'in line' for free sessions in Open OnDemand.
=== What's being done? ===
''compute030'' has been put into 'drain' so that once running jobs have completed it can be rebuilt.
In the meantime, other nodes are now picking up new Open OnDemand sessions and we've had no further reports of issues.
Please do [[mailto:hpc.researchcomputing@newcastle.ac.uk|email us]] at hpc.researchcomputing@newcastle.ac.uk if you notice problems on Comet, even things like missing libraries, which you might have dealt with by local installs on Rocket. We can't promise to fix everything centrally but we do aim to have Comet's core software operating properly.
------------
==== Previous Updates ====
* [[:status:index_2025|2025]]
----
[[:wiki:index|Back to HPC Documentation Home]]