Newcastle University HPC Portal

Our Research Projects

AI Driven Discovery for Designer Biocatalytic Enzymes for Specifically Cold-Active Alkaline α-Amylases in Sustainable Detergent Applications.

This is a project which is currently making use of HPC facilities at Newcastle University. It is active.

Project Contacts

For further information about this project, please contact:

Dr Gizem Buldum (gizem.buldum@newcastle.ac.uk)
Dr Penelope James (katherine.james@newcastle.ac.uk)

Project Description

Metagenomic Mining and ESMFold Structure Prediction of GH13 Sequences. This project will mine metagenomic libraries from Prozomix and Insiligence to expand the current GH13 sequence database from 5,000 to several million sequences. Searches will use the Pfam GH13 domain in HMMER3 profile-based analyses, filtering sequences by length (400–650 amino acids) and clustering at 90% identity to remove near duplicates. The primary computational task is ESMFold protein structure prediction for the selected sequence set. We estimate 10,000 GPU-hours for the full dataset; however, to balance throughput and queueing efficiency, we plan to run up to 12 NVIDIA A100 GPUs in parallel, with jobs submitted in stages. This phased approach allows efficient use of HPC resources while maintaining manageable wall-clock times (each stage 1–2 weeks). Storage Requirements: We anticipate requiring up to 15 TB for compressed PDB coordinate files. Data will be stored within project-allocated HPC research storage and managed efficiently to minimize unused data.

Software or Compute Methods

Metagenomic Mining and ESMFold Structure Prediction of GH13 Sequences. This project will mine metagenomic libraries from Prozomix and Insiligence to expand the current GH13 sequence database from 5,000 to several million sequences. Searches will use the Pfam GH13 domain in HMMER3 profile-based analyses, filtering sequences by length (400–650 amino acids) and clustering at 90% identity to remove near duplicates. The primary computational task is ESMFold protein structure prediction for the selected sequence set. We estimate 10,000 GPU-hours for the full dataset; however, to balance throughput and queueing efficiency, we plan to run up to 12 NVIDIA A100 GPUs in parallel, with jobs submitted in stages. This phased approach allows efficient use of HPC resources while maintaining manageable wall-clock times (each stage 1–2 weeks). Storage Requirements: We anticipate requiring up to 15 TB for compressed PDB coordinate files. Data will be stored within project-allocated HPC research storage and managed efficiently to minimize unused data.