This page is intended to act as a timeline of events for the Comet HPC project, as well as major changes in functionality or policies relating to the system.
Starting March 2026, we also publish a monthly summary newsletter to the HPC-Users email distribution list (to which all users of Comet are subscribed):
Following our previous update, we have completed final user testing to verify that all remaining issues have been resolved.
We are now able to confirm that the HPC service has been fully restored and is available for all users.
What have we changed?
As a result of the work undertaken, users should now experience improved performance, including:
What do you need to do?
You can now log in, copy and move files, and submit new compute jobs as normal. Any compute jobs which were pending at the time of Comet maintenance have been restarted automatically.
We recognise the significant disruption this outage has caused to work and research activity and appreciate your patience while we worked to restore the full service. For that reason, we have taken the following steps:
We will undertake a full review of this service disruption, and necessary measures will be taken to mitigate the risk of future disruption.
In further news posts we will outline the technical changes from this maintenance in more detail.
Following our previous update, we have continued to work with our third-party supplier to resolve the remaining issues identified during user acceptance testing.
We believe all these technical issues have now been resolved and are currently undertaking final stages of user testing to validate our findings. We are now aiming to restore access to the HPC service shortly.
We appreciate your continued patience while this work is completed.
A further scheduled update will be provided on Wednesday 10 June at 10:00.
We are approaching the last few issues on our road to relaunch of the HPC service.
As a reminder, the outstanding issues which needed to be addressed were:
$HOME
/rdw/
We have fixed the first issue by tracking the problems to the pair of bonded network interfaces on the two Comet head nodes. Each node has two 10G network links, which were configured in a load balanced bonded interface.
By removing the existing bonding configuration and testing the links independently we were able to verify much more consistent write speeds to the RDW fileshare, as well as greatly improved latency for interactive/SSH terminal sessions. These bonded interfaces have now been restored using an alternative, active/standby, configuration and performance remains much improved. We strongly suspect the latency which many users encountered, and the poor performance of some file transfers were caused when active user sessions were moving between the two links.
Secondly, the $HOME to RDW issue with > 2MB files was traced (using strace) to the Linux kernel copy_file_range() system call. This is a function call in newer Linux kernels which offers file copy offload acceleration features. This normally works on local filesystems to reduce the need for traditional read() and write() system calls, but is also supported on NFS v4.2… which is what the Research Data Warehouse uses by default.
strace
By forcing Comet to talk to RDW via NFS v4.1, we effectively block the kernel from activating the copy_file_range() function calls for file transfers to RDW, and the errors no longer occur. There is clearly an underlying issue in the interaction of the Comet Linux kernels, NFS client software, or the Comet NFS server - but this avoids the call entirely.
A small amount of work is needed to put these changes in to configuration management so that they are deployed automatically on future rebuilds, but these were the last issues which needed to be addressed prior to relaunch of the HPC service.
We hope to send further positive comms very soon.
Following restoration work with our third-party supplier, we have completed user acceptance testing to assess the current state of the University HPC service.
While significant progress has been made, and most acceptance criteria is now showing as achieving expected standards, we have identified that a small amount of functionality is not yet meeting the level of quality we require; this includes the following areas of functionality which are core to your use of the service:
As a result, we have reviewed the situation and concluded that the service is not fully operational and remains unavailable.
We are continuing to work to resolve this small number of remaining issues as quickly as possible and to ensure the service is fully operational before it is made available to users. We appreciate your continued patience while we work to fully restore the service. Our focus remains on resolving the remaining issues as quickly as possible.
We will provide a further update at Tuesday 9th June at 1pm.
Following restoration work with our third-party supplier, the HPC service is returning to a stable state, and we are now able to conduct user acceptance testing to verify all functionality is working as expected.
Following successful completion of user acceptance testing, we anticipate that the service will then be available for users on Monday 8 June.
We recognise the significant disruption this outage has caused to work and research activity and appreciate your patience while we worked to restore the full service.
We will provide a further update at 13:00 on Monday 8 June.
Progress continues to be made restoring the Comet HPC service, and a further update will be sent at 4pm today to give a summary and timeline of next steps.
However, we have reached a milestone of access to Lustre now being restored. As such we have looked back at our RDW to Comet speeds we recorded in February, and which were a big part of the reason for the maintenance on May 22nd. Updated performance data is shown below:
I'm pleased to say that our initial observations are that the “RDW to Lustre copy black hole” has now been resolved. We have recorded an incredible 14x performance improvement in the use of basic cp commands to transfer files from the RDW mountpoint (/rdw) directly to Lustre (/nobackup). To be clear:
cp
/rdw
/nobackup
Some further observations:
/scratch
We will continue working with the vendor to restore services, but this is both positive from a restoration of service perspective, as well as showing success against one of the main criteria for performing the maintenance in the first place.
The HPC service remains unavailable following an issue encountered during maintenance on Friday 22 May.
We recognise the significant disruption this outage is causing to work and research activity, and appreciate your patience while we work to restore the full service.
We continue to work closely with our third-party supplier to restore the service safely and in a stable state. At this stage, we are not yet able to confirm a timeframe for full-service restoration.
We will provide a scheduled update here at Friday 5th June at 4pm.
OCF engineers are in the data centre today. They will provide an update at the end of the working day today and we will provide more information as soon as we can after this.
update at 4:45pm: Work is continuing overnight. We will receive an update in the morning.
The faulty InfiniBand switch has been replaced but there have been issues during firmware updates and configuration of the switch. This means that Comet service will not be resumed today.
The following limitations are still in effect:
Whilst you cannot run any compute jobs, you can access the login nodes, and you can access your home areas if you have data or code in those areas.
Further updates will be shared here, via Teams and email as soon as we have more concrete information.
Hello everyone,
Unfortunately, we have further bad news regarding the HPC outage.
Whilst a replacement part for the faulty Infiniband switch module (which failed during maintenance) has been shipped, we have struggled to get the external hardware support engineers to schedule a date and time to undertake the installation.
We have had several dates for engineer visits slip and were most recently given a date for today (Friday 29th May), which would have resulted in Comet being back up and running for this weekend and ready for Monday 1st June.
Frustratingly, this last engineer visit was cancelled at the last minute, and we have therefore requested that this issue be escalated to senior management both within our HPC provider (OCF) and at the hardware manufacturer (Lenovo).
Right now, we do not have another date for the engineer visit and hardware replacement.
This is a serious situation to be in, and we understand that many of you will be rightfully concerned about your ability to proceed with your research. Clearly this is a very concerning breakdown in the SLA between the University and our equipment supplier and one which we are attempting to address as quickly as possible, using all means at our disposal.
However, for now - the following limitations are still in effect:
We will keep you updated on progress as we get those updates ourselves.
Regards John Snowdon Senior Research Infrastructure Engineer (HPC)
Dear Comet Users
We are sorry to tell you that Lenovo did not supply the required switch today and so Comet will have to remain out of service for at least one more day. We have been engaging with senior managers at Lenovo to expedite the replacement and this is very disappointing for all of us.
We apologise for this extended downtime and another update will be provided tomorrow, Friday 29th May
Kind Regards
HPC Support Team
Lenovo have now shipped the replacement Nvidia QM8700 FDR200 Infiniband switch module to the local parts distribution centre for collection today, May 27th, and then installation by their engineers tomorrow; May 28th.
As this is a fairly specialised piece of equipment, it's not feasible to pick up a replacement locally, so it needed to be shipped in.
Thankfully, most of the planned maintenance on Comet was completed last Friday (22nd May), so while there will need to be thorough testing of the replacement network switch, it should be feasible for the HPC service to be brought back up at some point tomorrow. Once this happens we will be able to verify that the performance improvements (networking and Lustre bandwidth) undertaken during the earlier maintenance.
We will notify all users once the system is available for use again.
We have now had confirmation from Lenovo (the manufacturer of the Comet hardware) that the faulty equipment cannot be repaired and a new replacement Nvidia FDR200 Infiniband switch module has now been requested and is on its way to the University.
Because of the specialist nature of the equipment, it's not as easy to source as something like a disk drive, power supply or memory module and hence cannot be delivered same-day.
Our HPC vendor (OCF) will liaise with Lenovo to replace the faulty switch module and install the replacement.
We are hopeful that this will be delivered and installed tomorrow, 27th May.
The Comet login nodes remain active (for the moment), but clearly no Slurm jobs are possible right now, and access to the login nodes may well be lost at any time - you should still consider Comet “down for maintenance”.
(emailed to hpc-users at 07:19am on Tue 25th May after the Bank Holiday Weekend)
Most of the work scheduled for last Friday with Comet was successful and we believe that we should start to see improved performance on network and file transfers as expected. However, one item of work which was planned (reboot of Infiniband network controller - to restore disconnected GPU nodes) did not complete successfully. The controller looks to have failed and will not power on fully; this means several key nodes (storage, GPU, high memory, low latency, etc) are isolated from the rest of the HPC facility. Rather than bring up the service in a semi-broken state and with many resources missing, we agreed with our vendor to keep it in maintenance mode until a replacement network controller switch can be delivered and installed > under warranty. This request was raised with Lenovo on Friday as a priority task, but as I'm sure you're already concluding, the holiday weekend will have slowed the process of delivery and arranging installation. At the moment we do not have > an ETA on the replacement part, but should be able to provide this later today. We will post again as soon as we have a clearer schedule. For now, please try to refrain from logging in to either login node - they may need to be restarted without warning at any time. Apologies for the unexpected extended downtime and thank you for your patience.
We are sorry to inform you that during the planned maintenance of Comet today, an essential InfiniBand component has failed to reboot. Much of the networking in the cluster relies on this so it seems likely that we will not be able to bring Comet back up until a hardware replacement has been delivered and installed.
Because of the Bank Holiday weekend, this means that:
Our current estimate of Comet HPC coming back online is Wednesday 27th May. Please do not attempt to log in to Comet even if the login nodes allow your connection. Essential components are not available and you might lose work.
An update will be provided on Tuesday 26th of May.
We have published the latest iteration of the Advanced Slurm Job Types guide which includes an overview of the two most common methods of introducing parallelism in your HPC jobs; task arrays and MPI.
The guide outlines the situations you may choose to use one over the other, as well as the types of data which may be more suited for each particular method.
The guide links to the Building a Parallel Task Array Solution article for those who arrived at the need for Slurm task arrays (including a step-by-step worked example), and we are in the process of editing the content for a similar article for those new to MPI.
(re-posting this important information from 8th May)
A change request to have Comet taken down for maintenance has now been approved by NUIT. This work will take place on Friday the 22nd of May and will mean Comet will be out of action for a full working day (9:00am - 5:00pm).
A maintenance reservation is being added which will mean any jobs you submit now which would overlap this date will not be accepted. Jobs which run up to this date will be scheduled and run as normal.
The work being undertaken includes:
This work will involve our support vendor performing the fixes both on-site and remotely.
A new, more comprehensive guide has been published to walk new users through the process of designing a task array solution. This takes you through building a single application solution, organising your input data and converting your workflow to a Slurm task array.
Also covered as limitations to consider when configuring your task array parameters.
We have added some in-house developed Slurm tools to Comet and published a guide on how to use them. Users who are interested in extracting job performance metrics, analysing the state of their Slurm account code, or understanding why certain of their jobs are not running are advised to take a look.
The open source tool Convert3D has been added to the the Bioapps container and will be available for use from the morning of 13th of May.
This release of the Bioapps container also updates the bundled R to version 4.6.x, in place of 4.5.x in earlier (⇐ 2026.04) versions. Version 2026.04 of the Bioapps container remains available for any users with a hard dependency on R 4.5.x
GipsyX is now available to members of selected Comet HPC project groups. This software is licensed restricted and is only available to groups who have a license to use it.
Consequently, the software is installed within a private container within specific project folders. It is not available system-wide.
Matlab 2026a is now available on Comet as a module. It is the default when using module load MATLAB, though the previous version (2024a) can still be selected if you supply the full version.
module load MATLAB
Open OnDemand has also been updated to use 2026a as the default version, again the previous version can be selected for new OOD sessions, if you wish.
Back to HPC Documentation Home
Table of Contents
HPC Service
Main Content Sections
Documentation Tools