====== HPC Service Updates - 2026 - April ====== ---- ===== (30th) April 2026 - Matlab 2026 ===== Our HPC vendor is in the process of installing the new Matlab 2026 release for us. This will become the default option for new Open OnDemand sessions as well as at the command line. The previous versions of Matlab will remain for the present time. ---- ===== (30th) April 2026 - New ANSYS software and support guides ===== We are in the process of implementing ANSYS 2025 R2, which will replace the older releases currently available on Comet. This version is configured specifically to work at the command line //and// from the Open OnDemand desktop environment. You can read the new ANSYS guide here: * [[advanced:software:ansys|ANSYS guide for Comet]] Note that this new version of ANSYS is not installed as a module - since the [[https://ansyshelp.ansys.com/public/account/secured?returnurl=/Views/Secured/corp/v252/en/installation/unix_platform_libraries.html|required software list]] from ANSYS is truly enormous. There will be a small difference in how you call the various ANSYS modules - all of which are in the process of being documented in the new ANSYS guide for Comet. ---- ===== (28th) April 2026 - Cometlogin01 has a SLOW connection to /nobackup ===== Cometlogin01 is now available once more so we again have 2 working login nodes. Due to ongoing infiniband hardware issues, after the recent reboot of cometlogin01, /nobackup (the lustre filesystem) has been mounted via a slower connection. Both login nodes are now functioning normally and can be used for: * **running** slurm jobs * **accessing** and **transferring** files to home, project and RDW * **accessing** /nobackup Because **cometlogin01 now has a much slower connection to lustre (/nobackup) please do NOT use it for large file transfers**. If you are on cometlogin01 and need to transfer large data with /nobackup, please either `ssh cometlogin02` log off and on again or until you get a session on cometlogin02. Many thanks for your patience and consideration for other users ---- ===== (28th) April 2026 - Comet maintenance and change request ===== A change request has been submitted in order to take Comet down for maintenance. 3009741 This will cover the following changes: * Restart of the faulty Infiniband backplane controller, and therefore **reinstating gpu004 and hgpu001** * Upgrade of RAM in the **Lustre storage (/nobackup)** (extra 128GB in each node) to improve performance in high load situations * Downgrade of the Lustre client software on the two login nodes, to bring them back to the same version as the compute nodes and thus restore **file copy performance** from the login nodes to /nobackup * Replacing a faulty fibre optic module on one of the uplinks from Comet to the Campus network (and a **//possible source// of the RDW connection reliability issues**) * Increase network buffer size on **NFS (home + software)** node to reduce packet drop in high load situations Because of the number of changes this will need to go through the NUIT change management process and so is subject to their approval first. The maintenance, because it is substantial, is expected to take Comet out for a full working day (e.g. 9-5). Until the change is approved //we will not have a definite date for the outage// - once approved we will look at the current running jobs and schedule the maintenance for as soon as we can, bearing in mind any jobs that are still running in the long partitions can run for up to 14 days. We therefore do not expect the maintenance to be any sooner than //18th of May//. Once we have a decision from the change management board and a date, we will post here and to the HPC-Users email distribution list. ---- ===== (27th) April 2026 - One login node is down ===== One of our two login nodes, ''cometlogin01'' is down after a crash this morning about 11am. Please continue to log in as normal at ''comet.hpc.ncl.ac.uk'' but note that you will only be connected to cometlogin02. If possible, please avoid resource heavy activities on the login node while capacity is reduced. Once we have more details and an estimate for the time to resolution we will post a further update. ---- ===== (21st) April 2026 - Freesurfer added to Comet ===== [[https://surfer.nmr.mgh.harvard.edu/|Freesurfer]] has now been added to Comet in the form of a new application container. * For more details see the [[:advanced:software:freesurfer|Freesurfer guide for Comet]] ---- ===== (20th) April 2026 - Ongoing Infiniband Issues ===== If you are waiting to access GPU resources on Comet (e.g. the **gpu-s_paid**, **gpu-s_free** or **gpu-l_paid** partitions) then please be aware that our HPC support vendor is currently investigating a network issue affecting all GPU nodes. It is likely that this network issue (an Infiniband network fabric controller problem) has been the cause of the stuck jobs, dropped Lustre connections and //drain// states on the GPU nodes. Since all GPU nodes share the same network fabric backplane, it's not possible for us to drop out the two (current) affected nodes (**gpu004** and **hgpu001**) and replace with the alternatives (since they are all connected to the same impacted network controller). Once we have more details and an estimate for the time to resolution we will post a further update. ---- ===== (20th) April 2026 - vLLM Installed ===== We have installed [[:advanced:software:vllm|vLLM]] on Comet within a container environment. This is an easy to use [[https://vllm.ai/|LLM]] inference engine. * Please see the [[:advanced:software:vllm|vLLM guide for Comet]] ---- ===== (10th) April 2026 - Planned redistribution of compute resources ===== Following analysis of the most recent [[https://hpc.researchcomputing.ncl.ac.uk/reports/|performance report data]] we are collecting from Comet, we have taken the decision to redistribute some of the compute resources. A request has been logged with our HPC vendor to move **six** compute nodes from **_paid** partitions to the equivalent **_free** partitions. This will change the resource distribution as follows: * **Paid** partitions (''short_paid'', ''default_paid'', ''long_paid'', ''interactive-std_paid''): //reduction// of **1536** cores, this still leaves over **10000** cores of compute for paid * **Free** partitions (''short_free'', ''default_free'', ''long_free'', ''interactive-std_free''): **2304** //existing// cores + **1536** //additional// cores = **3840** //total// cores You will not need to do anything to make use of these extra resources - your jobs submitted to the various **_free** partitions will automatically be distributed over the new resources as they become available. The intention is to reduce the waiting time for all free jobs - per-project resource limits on the number of cores and simultaneous jobs are __not__ intended to be increased. This work is expected to take place over the next few days. This work has been completed and the new node allocations are now in place. __No change__ is being made to GPU resources (''gpu-s_paid'', ''gpu-l_paid'', ''interactive-GPU_paid'', ''gpu-s_free'' or ''interactive-GPU_free'') at this time. ---- ===== (9th) April 2026 - srun sessions not starting on several partitions ===== Several users have reported that normal ''srun'' sessions are not starting across various partitions. The typical error will show as follows: $ srun --partition=gpu-l_paid --account=my_account_name --pty bash srun: job 1337731 queued and waiting for resources srun: job 1337731 has been allocated resources srun: StepId=1337731.0 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: task 0 launch failed: Unspecified error So far we have identified this happening on the following partitions: * default_paid * low-latency_paid * gpu-s_paid * gpu-l_paid It is //not// present on: * gpu-s_free Due to current resource allocation, we do not have any evidence yet for: * default_free * short_free * long_free Currently ''sbatch'' jobs are unaffected. A software incident has been raised with our HPC support vendor to begin looking at this issue. ---- ===== (9th) April 2026 - Software module help pages ===== The [[:advanced:software_list|software list]] has now been updated to provide an information page for **every software module which was requested to be installed on Comet**. There are a small number of software packages which were requested which we have found to be missing - these will be followed up with our HPC vendor for installation. * See the updated [[:advanced:software_list|Comet software list]] ---- ===== (9th) April 2026 - New version of CASTEP (26.11) ===== A new version of the [[advanced:software:castep|CASTEP container]] has been released. This is now updated to the latest 26.1.1 release of the application. Full details are included in the CASTEP user guide wiki page. * See [[advanced:software:castep|CASTEP container user guide]] ---- [[:status:index|Back to HPC News & Changes]]