This is a project which is currently making use of HPC facilities at Newcastle University. It is active.
For further information about this project, please contact:
This project investigates the internal
functional organisation of the Pairformer
architecture within Protenix, a fully
open-source reproduction of AlphaFold3,
through systematic block-wise ablation
analysis across three biological
interaction task categories:
protein–protein interactions,
protein–ligand binding, and
protein–nucleic acid interactions.
The central research question asks
whether individual Pairformer blocks
contribute differentially to distinct
biological prediction tasks, or whether
block importance is uniformly distributed
across all task types. To answer this,
each of the 48 Pairformer blocks will
be systematically ablated and the impact
on prediction accuracy measured
independently for each task category
using standard structural metrics
including LDDT and DockQ scores,
producing a comprehensive block-task
importance map.
Alongside the ablation study, internal
pair representations and attention
patterns will be extracted and visualised
at each block depth to characterise
what geometric and structural information
is encoded at different stages of the
Pairformer. An additional analysis will
investigate the hallucination phenomenon
in Protenix — where the diffusion module
incorrectly generates ordered structures
in intrinsically disordered protein
regions — by comparing internal pair
representations of correctly predicted,
hallucinated, and genuinely disordered
regions to identify representation
signatures that precede hallucination
events.
The findings will be used to design
and evaluate a task-adaptive conditional
routing framework that dynamically
activates only the Pairformer blocks
critical for the detected biological
task type, offering a principled and
computationally efficient alternative
to the uniform block pruning demonstrated
in Protenix-Mini. This work represents
the first systematic interpretability
study of the AF3/Protenix Pairformer
architecture and has direct applications
in drug discovery, antiviral research,
and efficient deployment of protein
structure prediction models.
The project will utilise the Protenix
open-source codebase — a PyTorch-based
faithful reproduction of AlphaFold3 —
as the primary deep learning framework
for all structure prediction and
ablation experiments. Protenix will
be deployed with pre-trained model
weights (~368 million parameters)
and modified to support block-wise
bypassing of individual Pairformer
layers for systematic ablation analysis.
The following software stack will
be used throughout the project:
Deep Learning and Scientific Computing:
PyTorch (GPU-accelerated deep learning),
NumPy, SciPy (numerical computation),
scikit-learn (statistical analysis
and machine learning),
UMAP and PCA (dimensionality reduction
for representation analysis),
Matplotlib and Seaborn (visualisation
and figure generation).
Structural Biology Tools:
PyMOL and UCSF Chimera
(3D structural visualisation),
Biopython (sequence and structure
file processing),
MDAnalysis (molecular structure analysis).
Workflow and Data Management:
CUDA-enabled GPU inference via PyTorch,
Conda environment management,
HDF5 and NumPy binary formats
for efficient storage of extracted
pair representations and attention
weight tensors.
The primary computational workload
consists of repeated GPU-accelerated
inference runs across the Protenix
model for systematic block ablation
(144 ablation runs: 48 blocks × 3
task categories), extraction and
storage of intermediate pair
representation tensors at each of
the 48 Pairformer block depths,
and attention weight matrix extraction
across multiple benchmark protein
complexes. These workloads are
highly parallelisable and will be
distributed across multiple GPUs
using PyTorch DataParallel and
batch job scheduling.
The project requires GPU compute
with a minimum of 24GB GPU memory
per device to accommodate Protenix
model weights and intermediate
activations during inference.
High-memory CPU nodes will be
utilised for post-processing,
dimensionality reduction, and
statistical analysis of extracted
representations. Significant storage
capacity is required for intermediate
activation tensors, with an estimated
storage requirement of 500GB–2TB
depending on benchmark dataset size
and representation extraction depth.
All software used is open-source
and freely available. No proprietary
licensed software is required for
this project.