Newcastle University HPC Portal

Our Research Projects

Task specific analysis of pairformer block importance in protenix and conditionally routing for better prediction

This is a project which is currently making use of HPC facilities at Newcastle University. It is active.

Project Contacts

For further information about this project, please contact:

Dr Jichun Li (jichun.li@newcastle.ac.uk)
Dr Penelope James (katherine.james@newcastle.ac.uk)

Project Description

This project investigates the internal

functional organisation of the Pairformer

architecture within Protenix, a fully

open-source reproduction of AlphaFold3,

through systematic block-wise ablation

analysis across three biological

interaction task categories:

protein–protein interactions,

protein–ligand binding, and

protein–nucleic acid interactions.

The central research question asks

whether individual Pairformer blocks

contribute differentially to distinct

biological prediction tasks, or whether

block importance is uniformly distributed

across all task types. To answer this,

each of the 48 Pairformer blocks will

be systematically ablated and the impact

on prediction accuracy measured

independently for each task category

using standard structural metrics

including LDDT and DockQ scores,

producing a comprehensive block-task

importance map.

Alongside the ablation study, internal

pair representations and attention

patterns will be extracted and visualised

at each block depth to characterise

what geometric and structural information

is encoded at different stages of the

Pairformer. An additional analysis will

investigate the hallucination phenomenon

in Protenix — where the diffusion module

incorrectly generates ordered structures

in intrinsically disordered protein

regions — by comparing internal pair

representations of correctly predicted,

hallucinated, and genuinely disordered

regions to identify representation

signatures that precede hallucination

events.

The findings will be used to design

and evaluate a task-adaptive conditional

routing framework that dynamically

activates only the Pairformer blocks

critical for the detected biological

task type, offering a principled and

computationally efficient alternative

to the uniform block pruning demonstrated

in Protenix-Mini. This work represents

the first systematic interpretability

study of the AF3/Protenix Pairformer

architecture and has direct applications

in drug discovery, antiviral research,

and efficient deployment of protein

structure prediction models.

Software or Compute Methods

The project will utilise the Protenix

open-source codebase — a PyTorch-based

faithful reproduction of AlphaFold3 —

as the primary deep learning framework

for all structure prediction and

ablation experiments. Protenix will

be deployed with pre-trained model

weights (~368 million parameters)

and modified to support block-wise

bypassing of individual Pairformer

layers for systematic ablation analysis.

The following software stack will

be used throughout the project:

Deep Learning and Scientific Computing:

PyTorch (GPU-accelerated deep learning),

NumPy, SciPy (numerical computation),

scikit-learn (statistical analysis

and machine learning),

UMAP and PCA (dimensionality reduction

for representation analysis),

Matplotlib and Seaborn (visualisation

and figure generation).

Structural Biology Tools:

PyMOL and UCSF Chimera

(3D structural visualisation),

Biopython (sequence and structure

file processing),

MDAnalysis (molecular structure analysis).

Workflow and Data Management:

CUDA-enabled GPU inference via PyTorch,

Conda environment management,

HDF5 and NumPy binary formats

for efficient storage of extracted

pair representations and attention

weight tensors.

The primary computational workload

consists of repeated GPU-accelerated

inference runs across the Protenix

model for systematic block ablation

(144 ablation runs: 48 blocks × 3

task categories), extraction and

storage of intermediate pair

representation tensors at each of

the 48 Pairformer block depths,

and attention weight matrix extraction

across multiple benchmark protein

complexes. These workloads are

highly parallelisable and will be

distributed across multiple GPUs

using PyTorch DataParallel and

batch job scheduling.

The project requires GPU compute

with a minimum of 24GB GPU memory

per device to accommodate Protenix

model weights and intermediate

activations during inference.

High-memory CPU nodes will be

utilised for post-processing,

dimensionality reduction, and

statistical analysis of extracted

representations. Significant storage

capacity is required for intermediate

activation tensors, with an estimated

storage requirement of 500GB–2TB

depending on benchmark dataset size

and representation extraction depth.

All software used is open-source

and freely available. No proprietary

licensed software is required for

this project.