Table of Contents

HPC Service Updates - 2026 - February


(27th) February 2026 - New Bioapps Container Image & Container Helper Scripts

We have published a new Bioapps container image to Comet.

This container has AMD Epyc optimised binaries of most of the bioinformatics software modules which either (a) are installed on Rocket or Comet, or (b) have been requested to be installed. This includes:

If this method of providing this extensive set of software works well, then it is our intention to continue to keep the container updated and move away from the complicated set of dependencies and modules and runtimes needed to install these as independent packages on Comet.

Please consult the Bioapps documentation page and let us know how things work - the issues of conflicting module dependencies when trying to use several of these packages in the same script should be substantially reduced with this method.

In addition, we've provided a simple container helper script for CASTEP and DNAscent - allowing you to source a single shell script and then use a new command we call container.run to run files inside the Apptainer container, complete with all necessary filesystem mounts etc. We hope you agree that running:

$ source /nobackup/shared/containers/container.sh
$ container.run app_name

… is much easier than …

$ module load apptainer
$ apptainer exec --bind /scratch:/scratch --bind /nobackup:/nobackup /nobackup/shared/containers/container.sif app_name


(26th) February 2026 - Login node 2

The second login node (cometlogin02) has had a number of load/ssh issues over the past 24 hours. Our HPC vendor is currently rebuilding/reconfiguring the node.

If you are allocated cometlogin02 whilst this work is under way you can connect back to the working (albeit more slowly than usual) cometlogin01 node by simply exiting the SSH session and then connecting again; every second SSH connection should (on average) see you connect to cometlogin01 then cometlogin02 and then back again.

As per our Connecting - Onsite page, you may also choose to connect directly to the first login node, if you wish.

Please bear with us while the second node is reinstated and normal performance is returned.


(25th) February 2026 - CASTEP

The software package CASTEP has now been installed on Comet. This has been built as a container image with AMD Epyc architecture optimised builds of LAPACK, OpenBLAS, FFTW3 and more. Both serial and parallel/MPI CASTEP binaries are included.

Please consult our CASTEP guide to get started.


(25th) February 2026 - Avogadro

The Avogadro 2 software module (module load Avogadro) now works on non-GPU Open OnDemand desktop sessions as well as GPU. There was a missing dependency on non-GPU desktops which was previously stopping it from starting.

You may still prefer a GPU desktop depending on the level of complexity of your data and your performance needs (it uses OpenGL to render/visualise output).


(24th) February 2026 - Apptainer Overlay Filesystems

The Apptainer guide has been update to include an example of how to use overlay images to turn read-only container images into write/read images.

Possibly useful to users of Comet who choose to use one of the (growing) number of Apptainer images provided at /nobackup/shared/containers, but still would like to add their own scripts/enhancements without needing to rebuild the entire base image.

This is based on the guidance available at: https://apptainer.org/docs/user/main/persistent_overlays.html


(24th) February 2026

Please note that in line with University policy, rates for paid resource use have now been updated for the 2026 financial year.

Updated figures can be viewed using the interactive cost calculator tool. The new pricing will be applied against invoices raised from end of February onwards.


(19th) February - New software modules

Our HPC vendor has now added these new modules:


(19th) February - Further Lustre Analysis

Having demonstrated that we are able to reach the limits of the network bandwidth between the Comet login node(s) and RDW, we have moved on to measuring performance between the login node and Lustre - thereby removing the external network connectivity from the equation.

Here are the results for the tests transferring the same data from the /scratch area on local NVMe drives to Lustre:

It's easy to see that we are getting the same performance characteristics as the RDW to Lustre results published yesterday. Here they are, side-by-side for comparison:

And one final set of data - Lustre to Lustre operations. This excludes the performance characteristics of all local filesystems and is purely focused on the speed of the Lustre client reading/writing to the Lustre service. No surprises, as this mirrors all other observations we have made so far:

(18th) February - Investigating RDW to Comet Speeds

A number of users of Comet have mentioned slower-than-expected speeds downloading data from the main University Research Data Warehouse (RDW). On initial investigation, using rsync to transfer data from RDW to a project folder on Lustre (i.e. in /nobackup) the transfer rate was low, but broadly equivalent (within 10's of megabytes) of what was observed doing the same with Rocket.

However, on closer examination we have found some very surprising results.

RDW to Comet login node disk

First, here are a set of results for transferring a large file from RDW to the local NVMe scratch space (/scratch) on a login node. This will always be the fastest method of getting data on to Comet, as we are only exercising a single network transfer and using the local NVMe drives of the login servers. This demonstrates the capacity of the link between RDW and Comet and/or the rest of the Campus network. RDW is mounted with rsize=1048576,wsize=1048576 params to optimise for large transfers:

Some observations:

RDW to Comet NFS Homes

Second, if transferring data from RDW to your home directory on Comet (/mnt/nfs/home), then we observe a very different level of performance. This involves two network filesystems - first to read from RDW and second to write to the NFSv4 mounted home directories. Both RDW and NFS Homes are mounted with the rsize=1048576,wsize=1048576 params to optimise for large transfers:

Observations:

RDW to Lustre

Lastly, the data for copying from RDW to Lustre, aka /nobackup. Again RDW is mounted with rsize=1048576,wsize=1048576 for optimal large transfers.

Key points:

Testing Methods

The test file is ~6000 Megabytes of random, non-compressible data, and was generated with: dd if=/dev/random of=/rdw/1/group/test_data bs=1M count=5000

Additional Notes

If using strace to monitor the read/write calls of the cp command, then it shows when copying from RDW to /scratch the requested block size is 1MB. This matches the RDW NFS mount params:

$ strace -e read,write cp /rdw/03/rse-hpc/test_file /scratch/ 2>&1 | head -15
...
...
read(3, "", 4096)                       = 0
read(3, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 1048576) = 1048576
write(4, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 1048576) = 1048576
read(3, "y\241\372;\357\246}\36\235l\0207\23\334\204\217T\366%\343\326\211\n\361J\314{\177\300\306J\235"..., 1048576) = 1048576
write(4, "y\241\372;\357\246}\36\235l\0207\23\334\204\217T\366%\343\326\211\n\361J\314{\177\300\306J\235"..., 1048576) = 1048576
read(3, "*\17\n\330n=i\235\355\214\337\37\263h\25;\333\337\334Yq&y\30\216\3}v\220\371\6\36"..., 1048576) = 1048576
write(4, "*\17\n\330n=i\235\355\214\337\37\263h\25;\333\337\334Yq&y\30\216\3}v\220\371\6\36"..., 1048576) = 1048576

If you do the same to Lustre it's trying to read and write in 4MB blocks:

$ strace -e read,write cp /rdw/03/rse-hpc/test_file /nobackup/proj/comet_training/n1234/1 2>&1 | head -15
...
...
read(3, "", 4096)                       = 0
read(3, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 4194304) = 4194304
write(4, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 4194304) = 4194304
read(3, "S\210\25i\1Y\234*dy/\377\324&\332\277\7o/\31\251z\315e\\SEy\232\373d\317"..., 4194304) = 4194304
write(4, "S\210\25i\1Y\234*dy/\377\324&\332\277\7o/\31\251z\315e\\SEy\232\373d\317"..., 4194304) = 4194304
read(3, "\311\350\351\320S\226\372\321\27\266\360z.\30\201\257\317\374\356\271\365;\32\240\367(.\302z;\211\330"..., 4194304) = 4194304
write(4, "\311\350\351\320S\226\372\321\27\266\360z.\30\201\257\317\374\356\271\365;\32\240\367(.\302z;\211\330"..., 4194304) = 4194304

Points to note:

  • If you are transferring data from RDW to Lustre then use rsync if you have many files to transfer.
  • If you have single, large files to transfer from RDW to Lustre then consider the use of dd if=sourcefile of=/path/to/destination/outfile bs=128k
  • Lustre to NFS home directories is largely consistent, regardless of tools used.
  • Block sizes between RDW (which is NFS) and Lustre (which is also a network filesystem) appear to be having a larger-than-expected impact for direct, network filesystem to network filesystem copies.
  • The very low cp performance to Lustre will be raised with our HPC vendor for further investigation.

(12th) February - Resource Limits Applied

As per our news article from yesterday, we have now implemented resource caps across the two types of project:

A breakdown of the resource limits will be added to the HPC Resources & Partitions - Comet page so that you understand how these apply to your jobs.

Again, the intention is to enable fairer access to the resources of the Comet HPC facility to the widest range of staff and students as possible. If you need to make a case for a different resource limit for your project, please get in touch.


(11th) February - Planned changes to resource limits

Now that the changes to the Slurm priorities and cpu over-subscription have been implemented our next phase of development for Comet will focus on the implementation of resource limits for both free and paid partitions.

These changes are intended to allow a fairer distribution of resources across all users and projects. Currently, as per Rocket, there are very few limits in place for the number of simultaneous resources or jobs. Going forwards we will introduce two distinct levels of resource caps:

As it stands today, the resource limits for all users of Comet are:

Resource Unfunded Projects Funded Projects
MaxSubmitJobs 512 512
MaxJobs 256 256
CPU Unlimited Unlimited
Nodes Unlimited Unlimited
GPU Unlimited Unlimited
GPU RAM Unlimited Unlimited
RAM Unlimited Unlimited
Local Disk (e.g. /scratch) Unlimited Unlimited

In the first iteration of the new resource caps, we will be implementing the following limits:

Resource Unfunded Projects Funded Projects
MaxSubmitJobs 128 512
MaxJobs 256 1024
CPU 512 2048
Nodes Unlimited Unlimited
GPU 1 8
GPU RAM Unlimited Unlimited
RAM Unlimited Unlimited
Local Disk Unlimited Unlimited

These limits will be per project, and are intended to stop the situation where a small number of users are able to monopolise almost the entire resources of a given partition (either paid or free). We will notify all registered users via the HPC-Users distribution list once these resource limits are in place. In most cases the only impact this will have on your jobs is that they may need to queue for a little longer if you are submitting a large number at the same time, again, this is to allow more users and projects to have a better chance of accessing the resources at the same time.

Those projects contributing towards the operation of Comet are given the higher resource cap for the duration that their funding balance remains positive.


(11th) February - Slurm maintenance underway

As posted earlier, the planned maintenance to the Slurm configuration is now underway (starting at 9:15am). We expect this to be completed shortly and afterwards any jobs in the PENDING state will automatically be released.

This work should resolve the CPU over-subscription issues which have been encountered, as well as the rare case which has resulted in jobs being stopped and rescheduled due to priority levels.

10:13am - Update

The change has now been implemented and the maintenance reservation window will shortly be removed. Any pending jobs should automatically restart.

We will be monitoring resource allocation closely over the next few hours/days - if you spot any cases which you believe may stem from CPU over-subscription again, please do let us know. The same goes for any jobs in any partition other than low-latency and default_paid which get suspended or rescheduled; please let us know as a priority.


(10th) February - Lustre write issues

A number of users have reported strange write issues on Lustre (/nobackup). These appear to manifest in text editors not being able to write content on the Lustre filesystem, but some tools like echo, cat and the standard shell redirection (> and ») are seemingly unaffected.

Kernel messages (.e.g. dmesg) on affected nodes are showing a number of Lustre warnings and errors that have occurred this afternoon.

An incident has been raised with our vendor to assess the situation and provide a fix.

15:18pm - Update

Our support vendor indicates that extreme load levels on one of the Lustre storage appliances may have been the cause of this incident. The affected services have been restarted and nodes appear to be reading/writing the /nobackup filesystem normally again. We will be following up with our vendor to get a more detailed explanation of the high load scenario and to determine how to prevent this from happening again.


(10th) February 2026 - Orca now installed

Orca (https://www.faccts.de/orca/) is now installed on Comet and can be loaded as follows:

module load Orca

Please note the specific upper-case first character of the name - ORCA or orca will not load it.


(4th) February - Comet login service restored

The vendor has restored the Comet login service (e.g. ssh comet.hpc.ncl.ac.uk). Unfortunately it appears that the SSH host key fingerprints for one of the login servers have been lost.

If you attempt to log in to Comet now you will see a warning from your SSH client looking like this:

$ ssh comet.hpc.ncl.ac.uk
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:ABCDEFGHIJKLMNOPQRTU12345678.
Please contact your system administrator.
$

This is expected, since the original fingerprints of that server have now changed. To resolve this one-time issue, run the following command on your Linux or Mac OS device:

$ ssh-keygen -f $HOME/.ssh/known_hosts -R comet.hpc.ncl.ac.uk

If you are using an alternative SSH client on a Windows platform, the error message from your software (e.g. PuTTY, Mobaxterm or similar) should indicate the equivalent command to run.


(4th) February 2026 - Comet login nodes

An issue has been identified with the Comet login nodes. The usual access method of ssh comet.hpc.ncl.ac.uk or ssh comet.ncl.ac.uk is not working.

Please use the direct connection alternatives: ssh cometlogin02.comet.hpc.ncl.ac.uk or ssh cometlogin01.comet.hpc.ncl.ac.uk to temporarily bypass the faulty configuration - the incident has been raised with our HPC vendor who are attempting to fix as a priority this afternoon.


(4th) February 2026 - Planned Changes to Slurm Configuration

Following the unintended configuration of Slurm priorities and pre-emption rules we have requested that our HPC vendor make the following changes to the operation of Comet:

In addition, we are taking the opportunity to make a better distribution of the available compute node resources. The following changes will be made to partitions:

The design intention of Comet was to put most of our compute resource into standard compute nodes (i.e. not low-latency), as the historical data from operating Rocket indicated most of our workloads fit into that classification. However we do have some users who need large scale parallel jobs, and that is what the low-latency_paid partition is for. Since we don't run low latency jobs most of the time we wanted the ability to use that resource when it is not being used for its original purpose.

The job pre-emption rules allow for this, and the specification as set for Comet stated:

Spare capacity in the low-latency_paid partition can be used by default_paid jobs to prevent it sitting idle and allow for 'burst' expansion of the default_paid partition… but this capacity must be evacuated and priority given over if a job is submitted to the low-latency_paid partition which would require them.
Jobs submitted to short_paid and long_paid are not subject to this configuration, neither are jobs submitted to any of the free partitions.

This does mean that if the default_paid partition is at capacity, your job may run on extra capacity provided by low-latency_paid, but it is in danger of being stopped/rescheduled if a real low-latency job is submitted. You should always consider adding checkpointing to your jobs to allow resuming from a service interruption.

The work to make the changes outlined above will be carried out on the morning of Wednesday 11th of February, at 9:00am. A resource reservation window has been put in place to prevent any jobs from running/starting during this piece of work. We expect the change to be implemented within a few minutes, but it involves a restart/reload of the Slurm service on each node, so we do not want to risk causing faults with running jobs at that time.


Return to HPC Service Updates & Project News