• Home
  • Accessing Our Facilities
    • Apply for Access
    • HPC Resource List
    • Our Staff
    • Our Research Projects
    • Our Research Software

    • Contributions & Costings
    • HPC Driving Test
  • Documentation
    • Documentation Home
    • Getting Started
    • Advanced Topics
    • Training & Workshops
    • FAQ
    • Policies & Procedures
    • Using the Wiki

    • Data & Report Terminology
    • About this website

    • Reports
  • My Account
    • My HPC Projects
HPC Support
Trace: • 036 • 037 • index

HPC News & Changes

This page is intended to act as a timeline of events for the Comet HPC project, as well as major changes in functionality or policies relating to the system.

Newsletters

Starting March 2026, we also publish a monthly summary newsletter to the HPC-Users email distribution list (to which all users of Comet are subscribed):

  • Newsletters

(10th) June 2026 - Comet HPC service now available

Following our previous update, we have completed final user testing to verify that all remaining issues have been resolved.

We are now able to confirm that the HPC service has been fully restored and is available for all users.

What have we changed?

As a result of the work undertaken, users should now experience improved performance, including:

  • Improved file transfer performance between Research Data Warehouse and Comet - up to 800MB/s read and 350MB/s write to RDW
  • Improved file transfer performance between Comet login nodes and Lustre (/nobackup) - up from 30-90MB/s to 400-800MB/s for single thread/single stripe writes
  • Generally improved latency/reduced keyboard & display lag for users working at the command line over SSH

What do you need to do?

You can now log in, copy and move files, and submit new compute jobs as normal. Any compute jobs which were pending at the time of Comet maintenance have been restarted automatically.

We recognise the significant disruption this outage has caused to work and research activity and appreciate your patience while we worked to restore the full service. For that reason, we have taken the following steps:

  • You can now log in, copy and move files, and submit new compute jobs as normal
  • All jobs which were pending at the time of Comet maintenance have been restarted
  • We have tripled the amount of compute resources allocated to the free partitions to allow the backlog of jobs to be completed at a faster rate
  • We have increased the maximum amount of resources each free HPC project can use (number of concurrent jobs, job array tasks, CPU cores and GPU cards) to allow you to get through your backlog of work, and any *new* jobs, at a faster rate

We will undertake a full review of this service disruption, and necessary measures will be taken to mitigate the risk of future disruption.

In further news posts we will outline the technical changes from this maintenance in more detail.


(9th) June 2026 - Comet relaunch

Following our previous update, we have continued to work with our third-party supplier to resolve the remaining issues identified during user acceptance testing.

We believe all these technical issues have now been resolved and are currently undertaking final stages of user testing to validate our findings. We are now aiming to restore access to the HPC service shortly.

We appreciate your continued patience while this work is completed.

A further scheduled update will be provided on Wednesday 10 June at 10:00.


(9th) June 2026 - Last remaining relaunch issues

We are approaching the last few issues on our road to relaunch of the HPC service.

As a reminder, the outstanding issues which needed to be addressed were:

  • Write speeds to Research Data Warehouse poor/inconsistent
  • Inability to copy files >= 2MB (more precisely; 2097152 bytes) from Comet $HOME to /rdw/

We have fixed the first issue by tracking the problems to the pair of bonded network interfaces on the two Comet head nodes. Each node has two 10G network links, which were configured in a load balanced bonded interface.

By removing the existing bonding configuration and testing the links independently we were able to verify much more consistent write speeds to the RDW fileshare, as well as greatly improved latency for interactive/SSH terminal sessions. These bonded interfaces have now been restored using an alternative, active/standby, configuration and performance remains much improved. We strongly suspect the latency which many users encountered, and the poor performance of some file transfers were caused when active user sessions were moving between the two links.

Secondly, the $HOME to RDW issue with > 2MB files was traced (using strace) to the Linux kernel copy_file_range() system call. This is a function call in newer Linux kernels which offers file copy offload acceleration features. This normally works on local filesystems to reduce the need for traditional read() and write() system calls, but is also supported on NFS v4.2… which is what the Research Data Warehouse uses by default.

By forcing Comet to talk to RDW via NFS v4.1, we effectively block the kernel from activating the copy_file_range() function calls for file transfers to RDW, and the errors no longer occur. There is clearly an underlying issue in the interaction of the Comet Linux kernels, NFS client software, or the Comet NFS server - but this avoids the call entirely.

A small amount of work is needed to put these changes in to configuration management so that they are deployed automatically on future rebuilds, but these were the last issues which needed to be addressed prior to relaunch of the HPC service.

We hope to send further positive comms very soon.


(8th) June 2026 - Comet relaunch acceptance criteria results

Following restoration work with our third-party supplier, we have completed user acceptance testing to assess the current state of the University HPC service.

While significant progress has been made, and most acceptance criteria is now showing as achieving expected standards, we have identified that a small amount of functionality is not yet meeting the level of quality we require; this includes the following areas of functionality which are core to your use of the service:

  • Upload/write speeds to the University Research Data Warehouse - whilst we now have greatly improved read speeds from RDW (In the order of 600MB/s), this has highlighted that we are seeing write speeds to the same location which are outside of what we would consider acceptable.
  • Comet $HOME to RDW file copy operations are not meeting our success criteria and need resolving to meet a number of edge cases

As a result, we have reviewed the situation and concluded that the service is not fully operational and remains unavailable.

We are continuing to work to resolve this small number of remaining issues as quickly as possible and to ensure the service is fully operational before it is made available to users. We appreciate your continued patience while we work to fully restore the service. Our focus remains on resolving the remaining issues as quickly as possible.

We will provide a further update at Tuesday 9th June at 1pm.


(5th) June 2026 - HPC service relaunch ETA

Following restoration work with our third-party supplier, the HPC service is returning to a stable state, and we are now able to conduct user acceptance testing to verify all functionality is working as expected.

Following successful completion of user acceptance testing, we anticipate that the service will then be available for users on Monday 8 June.

We recognise the significant disruption this outage has caused to work and research activity and appreciate your patience while we worked to restore the full service.

We will provide a further update at 13:00 on Monday 8 June.


(5th) June 2026 - Initial progress restoring HPC service

Progress continues to be made restoring the Comet HPC service, and a further update will be sent at 4pm today to give a summary and timeline of next steps.

However, we have reached a milestone of access to Lustre now being restored. As such we have looked back at our RDW to Comet speeds we recorded in February, and which were a big part of the reason for the maintenance on May 22nd. Updated performance data is shown below:

I'm pleased to say that our initial observations are that the “RDW to Lustre copy black hole” has now been resolved. We have recorded an incredible 14x performance improvement in the use of basic cp commands to transfer files from the RDW mountpoint (/rdw) directly to Lustre (/nobackup). To be clear:

  • Previous performance was as low as 49 Megabytes/sec in certain cases
  • We are now recording 680 Megabytes/sec for the same test

Some further observations:

  • Area A - native read performance from RDW, NFS or Lustre (e.g. direct to RAM)
  • Area B - write performance limited transfers to local SSD or Lustre
  • Area C - write performance limited transfers to NFS
  • Excluding the write performance characteristics of any of the Comet devices (e.g. /scratch, $HOME or /nobackup) we can see that the read performance from RDW sits firmly in Area A in the data, representing 1.6 - 1.9 Gigabytes/sec.
  • Write performance from RDW to the local /scratch SSD device is within Area B, representing real-world figures of 600-900 Megabytes/sec.
  • Write performance from most sources to /nobackup also shows performance in the 600-900 Megabytes/sec range which closely matches that of the local SSD (again, firmly within Area B), which is hugely improved when compared to before - Lustre is capable of much higher write speeds in parallel/stripe mode; this is just a single stripe/file
  • Write performance from any source to $HOME is limited to ~450 Megabytes/sec (see Area C) which is in line with our expectations of the general-purpose NFS server on which it is hosted.

We will continue working with the vendor to restore services, but this is both positive from a restoration of service perspective, as well as showing success against one of the main criteria for performing the maintenance in the first place.


(4th) June 2026 - Comet outage update

The HPC service remains unavailable following an issue encountered during maintenance on Friday 22 May.

We recognise the significant disruption this outage is causing to work and research activity, and appreciate your patience while we work to restore the full service.

We continue to work closely with our third-party supplier to restore the service safely and in a stable state. At this stage, we are not yet able to confirm a timeframe for full-service restoration.

We will provide a scheduled update here at Friday 5th June at 4pm.


(3rd) June 2026 - Comet repair - engineers on site

OCF engineers are in the data centre today. They will provide an update at the end of the working day today and we will provide more information as soon as we can after this.

update at 4:45pm: Work is continuing overnight. We will receive an update in the morning.


(2nd) June 2026 - Comet repair delayed

The faulty InfiniBand switch has been replaced but there have been issues during firmware updates and configuration of the switch. This means that Comet service will not be resumed today.

The following limitations are still in effect:

  • You cannot access the Lustre / nobackup / project filesystem on Comet
  • You cannot run any compute jobs
  • You cannot access the majority of the compute nodes

Whilst you cannot run any compute jobs, you can access the login nodes, and you can access your home areas if you have data or code in those areas.

Further updates will be shared here, via Teams and email as soon as we have more concrete information.


( 29th) May 2026 - Comet repair delayed

Hello everyone,

Unfortunately, we have further bad news regarding the HPC outage.

Whilst a replacement part for the faulty Infiniband switch module (which failed during maintenance) has been shipped, we have struggled to get the external hardware support engineers to schedule a date and time to undertake the installation.

We have had several dates for engineer visits slip and were most recently given a date for today (Friday 29th May), which would have resulted in Comet being back up and running for this weekend and ready for Monday 1st June.

Frustratingly, this last engineer visit was cancelled at the last minute, and we have therefore requested that this issue be escalated to senior management both within our HPC provider (OCF) and at the hardware manufacturer (Lenovo).

Right now, we do not have another date for the engineer visit and hardware replacement.

This is a serious situation to be in, and we understand that many of you will be rightfully concerned about your ability to proceed with your research. Clearly this is a very concerning breakdown in the SLA between the University and our equipment supplier and one which we are attempting to address as quickly as possible, using all means at our disposal.

However, for now - the following limitations are still in effect:

  • You cannot access the Lustre / nobackup / project filesystem on Comet
  • You cannot run any compute jobs
  • You cannot access the majority of the compute nodes

Whilst you cannot run any compute jobs, you can access the login nodes, and you can access your home areas if you have data or code in those areas.

We will keep you updated on progress as we get those updates ourselves.

Regards John Snowdon Senior Research Infrastructure Engineer (HPC)


(28th) May 2026 - Comet repair delayed

Dear Comet Users

We are sorry to tell you that Lenovo did not supply the required switch today and so Comet will have to remain out of service for at least one more day. We have been engaging with senior managers at Lenovo to expedite the replacement and this is very disappointing for all of us.

We apologise for this extended downtime and another update will be provided tomorrow, Friday 29th May

Kind Regards

HPC Support Team


(27th) May 2026 - Comet repair schedule

Lenovo have now shipped the replacement Nvidia QM8700 FDR200 Infiniband switch module to the local parts distribution centre for collection today, May 27th, and then installation by their engineers tomorrow; May 28th.

As this is a fairly specialised piece of equipment, it's not feasible to pick up a replacement locally, so it needed to be shipped in.

Thankfully, most of the planned maintenance on Comet was completed last Friday (22nd May), so while there will need to be thorough testing of the replacement network switch, it should be feasible for the HPC service to be brought back up at some point tomorrow. Once this happens we will be able to verify that the performance improvements (networking and Lustre bandwidth) undertaken during the earlier maintenance.

We will notify all users once the system is available for use again.


(26th) May 2026 - Update on Comet still down for maintenance

We have now had confirmation from Lenovo (the manufacturer of the Comet hardware) that the faulty equipment cannot be repaired and a new replacement Nvidia FDR200 Infiniband switch module has now been requested and is on its way to the University.

Because of the specialist nature of the equipment, it's not as easy to source as something like a disk drive, power supply or memory module and hence cannot be delivered same-day.

Our HPC vendor (OCF) will liaise with Lenovo to replace the faulty switch module and install the replacement.

We are hopeful that this will be delivered and installed tomorrow, 27th May.

The Comet login nodes remain active (for the moment), but clearly no Slurm jobs are possible right now, and access to the login nodes may well be lost at any time - you should still consider Comet “down for maintenance”.


(emailed to hpc-users at 07:19am on Tue 25th May after the Bank Holiday Weekend)

Most of the work scheduled for last Friday with Comet was successful and we believe that we should start to see improved performance on network and file transfers as expected.

However, one item of work which was planned (reboot of Infiniband network controller - to restore disconnected GPU nodes) did not complete successfully.

The controller looks to have failed and will not power on fully; this means several key nodes (storage, GPU, high memory, low latency, etc) are isolated from the rest of the HPC facility.

Rather than bring up the service in a semi-broken state and with many resources missing, we agreed with our vendor to keep it in maintenance mode until a replacement network controller switch can be delivered and installed > under warranty.

This request was raised with Lenovo on Friday as a priority task, but as I'm sure you're already concluding, the holiday weekend will have slowed the process of delivery and arranging installation. At the moment we do not have > an ETA on the replacement part, but should be able to provide this later today.

We will post again as soon as we have a clearer schedule. For now, please try to refrain from logging in to either login node - they may need to be restarted without warning at any time.

Apologies for the unexpected extended downtime and thank you for your patience.

(22nd) May 2026 - Issues encountered during Comet planned maintenance.  Expected delay to service resumption

We are sorry to inform you that during the planned maintenance of Comet today, an essential InfiniBand component has failed to reboot. Much of the networking in the cluster relies on this so it seems likely that we will not be able to bring Comet back up until a hardware replacement has been delivered and installed.

Because of the Bank Holiday weekend, this means that:

Our current estimate of Comet HPC coming back online is Wednesday 27th May. Please do not attempt to log in to Comet even if the login nodes allow your connection. Essential components are not available and you might lose work.

An update will be provided on Tuesday 26th of May.


(21st) May 2026 - Advanced Slurm Job Types - Parallel Workflows

We have published the latest iteration of the Advanced Slurm Job Types guide which includes an overview of the two most common methods of introducing parallelism in your HPC jobs; task arrays and MPI.

The guide outlines the situations you may choose to use one over the other, as well as the types of data which may be more suited for each particular method.

The guide links to the Building a Parallel Task Array Solution article for those who arrived at the need for Slurm task arrays (including a step-by-step worked example), and we are in the process of editing the content for a similar article for those new to MPI.

  • For further information: Advanced Slurm Job Types

(21st) May 2026 - Upcoming Comet Maintenance (22nd May)

(re-posting this important information from 8th May)

A change request to have Comet taken down for maintenance has now been approved by NUIT. This work will take place on Friday the 22nd of May and will mean Comet will be out of action for a full working day (9:00am - 5:00pm).

A maintenance reservation is being added which will mean any jobs you submit now which would overlap this date will not be accepted. Jobs which run up to this date will be scheduled and run as normal.

The work being undertaken includes:

  • Fitting additional RAM (+128GB) in each of the four Lustre servers to improve high IO load performance and reduce the possibility of system failure in extreme IO load situations.
  • Shutdown, update and restart of the faulty Infiniband backplane network controller, allowing us to reintroduce the isolated GPU nodes (hgpu001 and gpu004).
  • Replacing a faulty network uplink cable optic from Comet to the Campus network which is causing intermittent network dropouts between Comet login nodes and Campus.
  • Downgrade of the Linux kernel drivers and Lustre client software on the two login nodes to match the versions used on all other nodes (our vendor believes that because they are out-of-sync, this is the source of the poor performance of copying data from external - eg RDW - to /nobackup)
  • Increase size of Linux kernel network buffer on the NFS (home & software) storage server to cope with increased user load and reduce packet drop/retransmit figures

This work will involve our support vendor performing the fixes both on-site and remotely.


(20th) May 2026 - New guide for developing task array workflows

A new, more comprehensive guide has been published to walk new users through the process of designing a task array solution. This takes you through building a single application solution, organising your input data and converting your workflow to a Slurm task array.

Also covered as limitations to consider when configuring your task array parameters.

  • Read more on the Building a Parallel Task Array Solution guide

(18th) May 2026 - New Slurm tools available

We have added some in-house developed Slurm tools to Comet and published a guide on how to use them. Users who are interested in extracting job performance metrics, analysing the state of their Slurm account code, or understanding why certain of their jobs are not running are advised to take a look.

  • See our new Simple Slurm Tools guide page

(12th) May 2026 - Convert3D added to Bioapps container

The open source tool Convert3D has been added to the the Bioapps container and will be available for use from the morning of 13th of May.

This release of the Bioapps container also updates the bundled R to version 4.6.x, in place of 4.5.x in earlier (⇐ 2026.04) versions. Version 2026.04 of the Bioapps container remains available for any users with a hard dependency on R 4.5.x

  • For more information see our Bioapps container guide for Comet

(12th) May 2026 - GipsyX available

GipsyX is now available to members of selected Comet HPC project groups. This software is licensed restricted and is only available to groups who have a license to use it.

Consequently, the software is installed within a private container within specific project folders. It is not available system-wide.

  • For more information about the software, and how to use and update it yourself, please see our GipsyX guide for Comet

(8th) May 2026 - Upcoming Comet Maintenance (22nd May)

A change request to have Comet taken down for maintenance has now been approved by NUIT. This work will take place on Friday the 22nd of May and will mean Comet will be out of action for a full working day (9:00am - 5:00pm).

A maintenance reservation is being added which will mean any jobs you submit now which would overlap this date will not be accepted. Jobs which run up to this date will be scheduled and run as normal.

The work being undertaken includes:

  • Fitting additional RAM (+128GB) in each of the four Lustre servers to improve high IO load performance and reduce the possibility of system failure in extreme IO load situations.
  • Shutdown, update and restart of the faulty Infiniband backplane network controller, allowing us to reintroduce the isolated GPU nodes (hgpu001 and gpu004).
  • Replacing a faulty network uplink cable optic from Comet to the Campus network which is causing intermittent network dropouts between Comet login nodes and Campus.
  • Downgrade of the Linux kernel drivers and Lustre client software on the two login nodes to match the versions used on all other nodes (our vendor believes that because they are out-of-sync, this is the source of the poor performance of copying data from external - eg RDW - to /nobackup)
  • Increase size of Linux kernel network buffer on the NFS (home & software) storage server to cope with increased user load and reduce packet drop/retransmit figures

This work will involve our support vendor performing the fixes both on-site and remotely.


(6th) May 2026 - Matlab 2026

Matlab 2026a is now available on Comet as a module. It is the default when using module load MATLAB, though the previous version (2024a) can still be selected if you supply the full version.

Open OnDemand has also been updated to use 2026a as the default version, again the previous version can be selected for new OOD sessions, if you wish.

  • See: Matlab guide for Comet

Previous Updates

  • 2026 - April
  • 2026 - March
  • 2026 - February
  • 2026 - January
  • 2025

Back to HPC Documentation Home

Previous Next

HPC Support

Table of Contents

Table of Contents

  • HPC News & Changes
    • Newsletters
    • (10th) June 2026 - Comet HPC service now available
    • (9th) June 2026 - Comet relaunch
    • (9th) June 2026 - Last remaining relaunch issues
    • (8th) June 2026 - Comet relaunch acceptance criteria results
    • (5th) June 2026 - HPC service relaunch ETA
    • (5th) June 2026 - Initial progress restoring HPC service
    • (4th) June 2026 - Comet outage update
    • (3rd) June 2026 - Comet repair - engineers on site
    • (2nd) June 2026 - Comet repair delayed
    • ( 29th) May 2026 - Comet repair delayed
    • (28th) May 2026 - Comet repair delayed
    • (27th) May 2026 - Comet repair schedule
    • (26th) May 2026 - Update on Comet still down for maintenance
    • (22nd) May 2026 - Issues encountered during Comet planned maintenance.  Expected delay to service resumption
    • (21st) May 2026 - Advanced Slurm Job Types - Parallel Workflows
    • (21st) May 2026 - Upcoming Comet Maintenance (22nd May)
    • (20th) May 2026 - New guide for developing task array workflows
    • (18th) May 2026 - New Slurm tools available
    • (12th) May 2026 - Convert3D added to Bioapps container
    • (12th) May 2026 - GipsyX available
    • (8th) May 2026 - Upcoming Comet Maintenance (22nd May)
    • (6th) May 2026 - Matlab 2026
      • Previous Updates

HPC Service

  • News & Changes

Main Content Sections

  • Documentation Home
  • Getting Started
  • Advanced Topics
  • Training & Workshops
  • FAQ
  • Policies & Procedures
  • Using the Wiki
  • Contact us & Get Help

Documentation Tools

  • Wiki Login
  • RSE-HPC Team Area
Developed and operated by
Research Software Engineering
Copyright © Newcastle University
Contact us @rseteam