HPC Service Updates - 2026 - February

HPC Service Updates - 2026 - February

(27th) February 2026 - New Bioapps Container Image & Container Helper Scripts

We have published a new Bioapps container image to Comet.

This container has AMD Epyc optimised binaries of most of the bioinformatics software modules which either (a) are installed on Rocket or Comet, or (b) have been requested to be installed. This includes:

bwa, bwa-mem, bwa-meth, samtools, sambamba, htseq, hisat2, rseqc, bedtools, bamutils, bcl_convert and more

If this method of providing this extensive set of software works well, then it is our intention to continue to keep the container updated and move away from the complicated set of dependencies and modules and runtimes needed to install these as independent packages on Comet.

Please consult the Bioapps documentation page and let us know how things work - the issues of conflicting module dependencies when trying to use several of these packages in the same script should be substantially reduced with this method.

In addition, we've provided a simple container helper script for CASTEP and DNAscent - allowing you to source a single shell script and then use a new command we call container.run to run files inside the Apptainer container, complete with all necessary filesystem mounts etc. We hope you agree that running:

$ source /nobackup/shared/containers/container.sh
$ container.run app_name

… is much easier than …

$ module load apptainer
$ apptainer exec --bind /scratch:/scratch --bind /nobackup:/nobackup /nobackup/shared/containers/container.sif app_name

(26th) February 2026 - Login node 2

The second login node (cometlogin02) has had a number of load/ssh issues over the past 24 hours. Our HPC vendor is currently rebuilding/reconfiguring the node.

If you are allocated cometlogin02 whilst this work is under way you can connect back to the working (albeit more slowly than usual) cometlogin01 node by simply exiting the SSH session and then connecting again; every second SSH connection should (on average) see you connect to cometlogin01 then cometlogin02 and then back again.

As per our Connecting - Onsite page, you may also choose to connect directly to the first login node, if you wish.

Please bear with us while the second node is reinstated and normal performance is returned.

(25th) February 2026 - CASTEP

The software package CASTEP has now been installed on Comet. This has been built as a container image with AMD Epyc architecture optimised builds of LAPACK, OpenBLAS, FFTW3 and more. Both serial and parallel/MPI CASTEP binaries are included.

Please consult our CASTEP guide to get started.

(25th) February 2026 - Avogadro

The Avogadro 2 software module (module load Avogadro) now works on non-GPU Open OnDemand desktop sessions as well as GPU. There was a missing dependency on non-GPU desktops which was previously stopping it from starting.

You may still prefer a GPU desktop depending on the level of complexity of your data and your performance needs (it uses OpenGL to render/visualise output).

(24th) February 2026 - Apptainer Overlay Filesystems

The Apptainer guide has been update to include an example of how to use overlay images to turn read-only container images into write/read images.

Possibly useful to users of Comet who choose to use one of the (growing) number of Apptainer images provided at /nobackup/shared/containers, but still would like to add their own scripts/enhancements without needing to rebuild the entire base image.

This is based on the guidance available at: https://apptainer.org/docs/user/main/persistent_overlays.html

(24th) February 2026

Please note that in line with University policy, rates for paid resource use have now been updated for the 2026 financial year.

Updated figures can be viewed using the interactive cost calculator tool. The new pricing will be applied against invoices raised from end of February onwards.

(19th) February - New software modules

Our HPC vendor has now added these new modules:

LAPACK (3.12.0) - this is an earlier version than the current source code release. Load with module load LAPACK/3.12.0
ChimeraX - this is a graphical application which can be ran from the Open OnDemand desktop environment. Load with module load ChimeraX, and start with the command ChimeraX. Whilst you can run this on any desktop, rendering performance is likely to be too low on non-GPU nodes.
Avogadro 2 - this is a graphical application which can be ran from the Open OnDemand desktop environment. Load with module load Avogadro, and start with the command avogadro2. Currently this only runs on GPU nodes (it does not currently work via MESA / VirtualGL; this is being investigated).
Visual Studio Code / VS Code - this is a graphical application which can be ran from any of the Open OnDemand desktop environments. You can find it under the Applications → Development → VS Code menu entry once you have logged in to a desktop session. You may also start it from the command line with module load VSCode and then calling code.
Hypre - A library of solvers for linear equations. Load with module load Hypre.
BOLT-LMM - This is a fix to a previously published module - load with module load BOLT-LMM, start with bolt.

(19th) February - Further Lustre Analysis

Having demonstrated that we are able to reach the limits of the network bandwidth between the Comet login node(s) and RDW, we have moved on to measuring performance between the login node and Lustre - thereby removing the external network connectivity from the equation.

Here are the results for the tests transferring the same data from the /scratch area on local NVMe drives to Lustre:

It's easy to see that we are getting the same performance characteristics as the RDW to Lustre results published yesterday. Here they are, side-by-side for comparison:

Again cp, cat and default dd options are the slowest - ranging from 48 - 94 Megabytes/second.
Rynsc is relatively consistent compared to the other tests, around 240 Megabytes/second, but yet again, much lower than we would expect.
The use of mbuffer in the cat or dd pipelines also shows improvement in throughput.
Best speeds are achieved with a dd block size of 128 Kilobytes, reaching a transfer rate of over 700 Megabytes/second.
Raw figures may differ slightly to previous tests due to different loading/use of the system while the tests were run… but the overall behaviour is consistent with previous observations.

And one final set of data - Lustre to Lustre operations. This excludes the performance characteristics of all local filesystems and is purely focused on the speed of the Lustre client reading/writing to the Lustre service. No surprises, as this mirrors all other observations we have made so far:

(18th) February - Investigating RDW to Comet Speeds

A number of users of Comet have mentioned slower-than-expected speeds downloading data from the main University Research Data Warehouse (RDW). On initial investigation, using rsync to transfer data from RDW to a project folder on Lustre (i.e. in /nobackup) the transfer rate was low, but broadly equivalent (within 10's of megabytes) of what was observed doing the same with Rocket.

However, on closer examination we have found some very surprising results.

RDW to Comet login node disk

First, here are a set of results for transferring a large file from RDW to the local NVMe scratch space (/scratch) on a login node. This will always be the fastest method of getting data on to Comet, as we are only exercising a single network transfer and using the local NVMe drives of the login servers. This demonstrates the capacity of the link between RDW and Comet and/or the rest of the Campus network. RDW is mounted with rsize=1048576,wsize=1048576 params to optimise for large transfers:

Some observations:

This shows that using standard cp command the speeds attained are around the 1.9 Gigabytes/second range; exactly what we would expect.
If you instead use rsync, then your speeds typically drop to about 370 Megabytes/second. We would expect lower speeds from rsync than less complex methods, but this does seems lower than expected.
Using the lower-level dd tool we see speeds steadily increasing as we use successively larger block sizes, transfer rates level off after increasing to 64 Kilobyte blocks.
Piping reads through an in-memory buffer (i.e. using mbuffer) to smooth out reads and writes has no detectable benefit and remains no faster than rsync.
The connectivity between RDW and Comet is not a bottleneck in this situation - with a well chosen transfer tool and correct sizing, we are able to repeatedly transfer from RDW at rates approaching 2 Gigabytes/Second.

RDW to Comet NFS Homes

Second, if transferring data from RDW to your home directory on Comet (/mnt/nfs/home), then we observe a very different level of performance. This involves two network filesystems - first to read from RDW and second to write to the NFSv4 mounted home directories. Both RDW and NFS Homes are mounted with the rsize=1048576,wsize=1048576 params to optimise for large transfers:

Observations:

Performance is much more stable and consistent, regardless of the transfer tool used.
Upper bounds on performance is, however, much lower than writing to the NVMe /scratch drives on the login nodes, as we are making an NFS read from RDW to get the data, and then making an NFS write to the Comet home directory server.
Lowest speeds are obtained using rsync (around 250 Megabytes/second) and dd with default block sizes (usually 512 bytes as standard, giving ~210 Megabytes/second).
Highest speeds are obtained using dd with a block size of 128 Kilobytes (netting a speed of 530 Megabytes/second, though gains are negligible after 32-64 Kilobytes.
Using dd in conjunction with mbuffer offers no benefits.

RDW to Lustre

Lastly, the data for copying from RDW to Lustre, aka /nobackup. Again RDW is mounted with rsize=1048576,wsize=1048576 for optimal large transfers.

Key points:

Using basic tools such as cp and cat results in the worst possible speeds - at the lowest point this can be as bad as 40 Megabytes/second.
The second worst speeds are observed using dd with the default 512 Byte block size; this achieves no better than 100 Megabytes/second.
Using rsync instead shows a consistent increase to around 290 Megabytes/second, making this a better option than cp, cat or dd.
Then we have the option of piping the output of cat through mbuffer. For example cat /rdw/01/group/file.dat | mbuffer > /nobackup/proj/group/file.dat. This increases speeds again, to around 450-500 Megabytes/second.
Moving to dd we see substantial, incremental improvements in transfer speed, peaking at more than 850 Megabytes/second, which is the highest result we have observed when copying direct from RDW to Lustre. An example would be: dd if=/rdw/01/group/file.dat of=/nobackup/proj/group/file.dat bs=1M.
Increasing dd block size beyond 1M results in performance dropping sharply.
The use of mbuffer with dd shows inconsistent performance and is not advised.

Testing Methods

The test file is ~6000 Megabytes of random, non-compressible data, and was generated with: dd if=/dev/random of=/rdw/1/group/test_data bs=1M count=5000

$SOURCE, a path on RDW, e.g.: /rdw/1/group/test_file
$DEST, a path on either local NVMe drives, NFS home, or Lustre: /scratch, /mnt/nfs/home/$USER or /nobackup/proj/PROJ_NAME
$BS, either 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1m or 2m
Standard copy: time cp $SOURCE $DEST
rsync: time rsync –progress $SOURCE $DEST
cat: time cat $SOURCE > $DEST/test_file
cat + mbuffer: time cat $SOURCE | mbuffer -m 1g -p 10 > $DEST/test_file
dd: time dd if=$SOURCE of=$DEST/test_file
dd + blocksize: time dd if=$SOURCE of=$DEST/test_file bs=$BS
dd + mbuffer + blocksize: time dd if=$SOURCE bs=$BS | mbuffer -m 1g -p 10 > $DEST/test_file
Destination file was deleted before and after each test run.
System was allowed to quiesce for 10 seconds between each test.
Each test was ran at least three times at different intervals to account for other system use.
The real time was used to record the duration of the copy.
Size of test file in bytes / real time = bytes_per_second

Additional Notes

If using strace to monitor the read/write calls of the cp command, then it shows when copying from RDW to /scratch the requested block size is 1MB. This matches the RDW NFS mount params:

$ strace -e read,write cp /rdw/03/rse-hpc/test_file /scratch/ 2>&1 | head -15
...
...
read(3, "", 4096)                       = 0
read(3, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 1048576) = 1048576
write(4, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 1048576) = 1048576
read(3, "y\241\372;\357\246}\36\235l\0207\23\334\204\217T\366%\343\326\211\n\361J\314{\177\300\306J\235"..., 1048576) = 1048576
write(4, "y\241\372;\357\246}\36\235l\0207\23\334\204\217T\366%\343\326\211\n\361J\314{\177\300\306J\235"..., 1048576) = 1048576
read(3, "*\17\n\330n=i\235\355\214\337\37\263h\25;\333\337\334Yq&y\30\216\3}v\220\371\6\36"..., 1048576) = 1048576
write(4, "*\17\n\330n=i\235\355\214\337\37\263h\25;\333\337\334Yq&y\30\216\3}v\220\371\6\36"..., 1048576) = 1048576

If you do the same to Lustre it's trying to read and write in 4MB blocks:

$ strace -e read,write cp /rdw/03/rse-hpc/test_file /nobackup/proj/comet_training/n1234/1 2>&1 | head -15
...
...
read(3, "", 4096)                       = 0
read(3, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 4194304) = 4194304
write(4, ";\362C\233%\364\233Z\233\301}>\22+\t\357_l\302EL\360>\36\310\335.F+\204\244\235"..., 4194304) = 4194304
read(3, "S\210\25i\1Y\234*dy/\377\324&\332\277\7o/\31\251z\315e\\SEy\232\373d\317"..., 4194304) = 4194304
write(4, "S\210\25i\1Y\234*dy/\377\324&\332\277\7o/\31\251z\315e\\SEy\232\373d\317"..., 4194304) = 4194304
read(3, "\311\350\351\320S\226\372\321\27\266\360z.\30\201\257\317\374\356\271\365;\32\240\367(.\302z;\211\330"..., 4194304) = 4194304
write(4, "\311\350\351\320S\226\372\321\27\266\360z.\30\201\257\317\374\356\271\365;\32\240\367(.\302z;\211\330"..., 4194304) = 4194304

Points to note:

If you are transferring data from RDW to Lustre then use rsync if you have many files to transfer.
If you have single, large files to transfer from RDW to Lustre then consider the use of dd if=sourcefile of=/path/to/destination/outfile bs=128k
Lustre to NFS home directories is largely consistent, regardless of tools used.
Block sizes between RDW (which is NFS) and Lustre (which is also a network filesystem) appear to be having a larger-than-expected impact for direct, network filesystem to network filesystem copies.
The very low cp performance to Lustre will be raised with our HPC vendor for further investigation.

(12th) February - Resource Limits Applied

As per our news article from yesterday, we have now implemented resource caps across the two types of project:

Funded (specifically, those projects with positive balances)
Unfunded

A breakdown of the resource limits will be added to the HPC Resources & Partitions - Comet page so that you understand how these apply to your jobs.

Again, the intention is to enable fairer access to the resources of the Comet HPC facility to the widest range of staff and students as possible. If you need to make a case for a different resource limit for your project, please get in touch.

(11th) February - Planned changes to resource limits

Now that the changes to the Slurm priorities and cpu over-subscription have been implemented our next phase of development for Comet will focus on the implementation of resource limits for both free and paid partitions.

These changes are intended to allow a fairer distribution of resources across all users and projects. Currently, as per Rocket, there are very few limits in place for the number of simultaneous resources or jobs. Going forwards we will introduce two distinct levels of resource caps:

Unfunded projects accessing _free partitions
Funded projects accessing _paid partitions

As it stands today, the resource limits for all users of Comet are:

Resource	Unfunded Projects	Funded Projects
MaxSubmitJobs	512	512
MaxJobs	256	256
CPU	Unlimited	Unlimited
Nodes	Unlimited	Unlimited
GPU	Unlimited	Unlimited
GPU RAM	Unlimited	Unlimited
RAM	Unlimited	Unlimited
Local Disk (e.g. `/scratch`)	Unlimited	Unlimited

In the first iteration of the new resource caps, we will be implementing the following limits:

Resource	Unfunded Projects	Funded Projects
MaxSubmitJobs	128	512
MaxJobs	256	1024
CPU	512	2048
Nodes	Unlimited	Unlimited
GPU	1	8
GPU RAM	Unlimited	Unlimited
RAM	Unlimited	Unlimited
Local Disk	Unlimited	Unlimited

These limits will be per project, and are intended to stop the situation where a small number of users are able to monopolise almost the entire resources of a given partition (either paid or free). We will notify all registered users via the HPC-Users distribution list once these resource limits are in place. In most cases the only impact this will have on your jobs is that they may need to queue for a little longer if you are submitting a large number at the same time, again, this is to allow more users and projects to have a better chance of accessing the resources at the same time.

Those projects contributing towards the operation of Comet are given the higher resource cap for the duration that their funding balance remains positive.

(11th) February - Slurm maintenance underway

As posted earlier, the planned maintenance to the Slurm configuration is now underway (starting at 9:15am). We expect this to be completed shortly and afterwards any jobs in the PENDING state will automatically be released.

This work should resolve the CPU over-subscription issues which have been encountered, as well as the rare case which has resulted in jobs being stopped and rescheduled due to priority levels.

10:13am - Update

The change has now been implemented and the maintenance reservation window will shortly be removed. Any pending jobs should automatically restart.

We will be monitoring resource allocation closely over the next few hours/days - if you spot any cases which you believe may stem from CPU over-subscription again, please do let us know. The same goes for any jobs in any partition other than low-latency and default_paid which get suspended or rescheduled; please let us know as a priority.

(10th) February - Lustre write issues

A number of users have reported strange write issues on Lustre (/nobackup). These appear to manifest in text editors not being able to write content on the Lustre filesystem, but some tools like echo, cat and the standard shell redirection (> and ») are seemingly unaffected.

Kernel messages (.e.g. dmesg) on affected nodes are showing a number of Lustre warnings and errors that have occurred this afternoon.

An incident has been raised with our vendor to assess the situation and provide a fix.

15:18pm - Update

Our support vendor indicates that extreme load levels on one of the Lustre storage appliances may have been the cause of this incident. The affected services have been restarted and nodes appear to be reading/writing the /nobackup filesystem normally again. We will be following up with our vendor to get a more detailed explanation of the high load scenario and to determine how to prevent this from happening again.

(10th) February 2026 - Orca now installed

Orca (https://www.faccts.de/orca/) is now installed on Comet and can be loaded as follows:

module load Orca

Please note the specific upper-case first character of the name - ORCA or orca will not load it.

(4th) February - Comet login service restored

The vendor has restored the Comet login service (e.g. ssh comet.hpc.ncl.ac.uk). Unfortunately it appears that the SSH host key fingerprints for one of the login servers have been lost.

If you attempt to log in to Comet now you will see a warning from your SSH client looking like this:

$ ssh comet.hpc.ncl.ac.uk
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:ABCDEFGHIJKLMNOPQRTU12345678.
Please contact your system administrator.
$

This is expected, since the original fingerprints of that server have now changed. To resolve this one-time issue, run the following command on your Linux or Mac OS device:

$ ssh-keygen -f $HOME/.ssh/known_hosts -R comet.hpc.ncl.ac.uk

If you are using an alternative SSH client on a Windows platform, the error message from your software (e.g. PuTTY, Mobaxterm or similar) should indicate the equivalent command to run.

(4th) February 2026 - Comet login nodes

An issue has been identified with the Comet login nodes. The usual access method of ssh comet.hpc.ncl.ac.uk or ssh comet.ncl.ac.uk is not working.

Please use the direct connection alternatives: ssh cometlogin02.comet.hpc.ncl.ac.uk or ssh cometlogin01.comet.hpc.ncl.ac.uk to temporarily bypass the faulty configuration - the incident has been raised with our HPC vendor who are attempting to fix as a priority this afternoon.

(4th) February 2026 - Planned Changes to Slurm Configuration

Following the unintended configuration of Slurm priorities and pre-emption rules we have requested that our HPC vendor make the following changes to the operation of Comet:

Priority levels will be removed from all partitions - excluding default_paid and low-latency_paid
Job pre-emption and rescheduling will be disabled on all partitions - excluding default_paid and low-latency_paid
Sharing / over-subscription of compute node resources will be disabled on all partitions

In addition, we are taking the opportunity to make a better distribution of the available compute node resources. The following changes will be made to partitions:

Two further compute nodes (hmem009 and hmem010) to be added to short_free, long_free and interactive-std_free partitions. Giving a total of 9 compute nodes / 2304 cores across all free partitions.
Seven further compute nodes (hmem001-004 and hmem006-008) to be added to short_paid, long_paid and interactive-std_paid partitions. Giving a total of 39 compute nodes / 9984 cores across all paid nodes, plus a further 4 nodes accessed from the low-latency_paid partition if they are idle for a total combined core count of 11008.

The design intention of Comet was to put most of our compute resource into standard compute nodes (i.e. not low-latency), as the historical data from operating Rocket indicated most of our workloads fit into that classification. However we do have some users who need large scale parallel jobs, and that is what the low-latency_paid partition is for. Since we don't run low latency jobs most of the time we wanted the ability to use that resource when it is not being used for its original purpose.

The job pre-emption rules allow for this, and the specification as set for Comet stated:

Spare capacity in the low-latency_paid partition can be used by default_paid jobs to prevent it sitting idle and allow for 'burst' expansion of the default_paid partition… but this capacity must be evacuated and priority given over if a job is submitted to the low-latency_paid partition which would require them.
Jobs submitted to short_paid and long_paid are not subject to this configuration, neither are jobs submitted to any of the free partitions.

This does mean that if the default_paid partition is at capacity, your job may run on extra capacity provided by low-latency_paid, but it is in danger of being stopped/rescheduled if a real low-latency job is submitted. You should always consider adding checkpointing to your jobs to allow resuming from a service interruption.

The work to make the changes outlined above will be carried out on the morning of Wednesday 11th of February, at 9:00am. A resource reservation window has been put in place to prevent any jobs from running/starting during this piece of work. We expect the change to be implemented within a few minutes, but it involves a restart/reload of the Slurm service on each node, so we do not want to risk causing faults with running jobs at that time.

Return to HPC Service Updates & Project News

Table of Contents