pgap

pgap

NCBI Prokaryotic Genome Annotation Pipeline

The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms. Post-annotation, the completeness of the annotated gene set is estimated with CheckM.

The workflow provided here also offers the option to confirm or correct the organism associated with the genome assembly prior to starting the annotation, using the Average Nucleotide Identity tool.

See: https://github.com/ncbi/pgap

Running pgap on Comet

PGAP uses Python and Apptainer to run. It also downloads a significant amount of image data on which the pipelines are based. To prevent duplicate data we have already downloaded this data and set up the PGAP module to work from a central, shared image data directory.

You can run pgap as follows:

$ module load PGAP
$ pgap -h
usage: pgap.py [-h] [-g GENOME] [-s ORGANISM] [-V] [-v] [--taxcheck | --taxcheck-only] [--auto-correct-tax]
               [-l | -u] [-r | -n] [--container-name CONTAINER_NAME] [--container-path CONTAINER_PATH]
               [--ignore-all-errors] [--no-internet] [-D path] [-o path] [-q] [--prefix PREFIX]
               [--no-self-update] [-c CPUS] [-m MEMORY] [-d]

Input must be provided as:
 1. a fasta/organism pair, e.g.
    pgap.py ... -g input.fasta -s 'Escherichia coli'
or
 2. a YAML configuration file, e.g.
    pgap.py ... input.yaml

options:
  -h, --help            show this help message and exit
$

For more details, including how to run under Slurm, see the Quickstart Example below.

The Quickstart Example

Following the Quick Start example from the PGAP wiki page, we can run it on Comet as follows:

#!/bin/bash

#SBATCH --account=my_account
#SBATCH --partition=default_free
#SBATCH -c 8
#SBATCH --mem=16G

module load PGAP
pgap \
   -o $HOME/mg37_results \
   -g $PGAP_INPUT_DIR/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta \
   -s "Mycoplasmoides genitalium"

Note the following:

We call pgap instead of pgap.py
We use $PGAP_INPUT_DIR to refer to the location of the already-download image data

Updates

Periodically there are updates to the pgap.py script, as well as the image data used by PGAP. If you need PGAP updating, then please Contact us, and we will update the PGAP install and data for all users of Comet - please do not download and install another version of PGAP just for yourself.

Initial setup of pgap on Comet

Important!

This section is only of relevance to RSE HPC administrators or users who wish to understand how pgap was initially set up. If you are only interested in using pgap, then stop reading here.

Download commands

There's no real install process for PGAP, just ensuring that the download is to the correct (i.e. shared) location, so that we do not end up with multiple copies of the image data.

#!/bin/bash

# Perform the initial download of pgap containers and datasets
module load apptainer
module load Python/3.12.3

# For some reason, pgap will not
# install correctly if downloaded
# directly on to Lustre. Instead,
# build it in /scratch

# See https://github.com/ncbi/pgap/issues/333
export PGAP_INPUT_DIR=/scratch/pgap
mkdir /scratch/pgap

# Afterwards, PGAP_INPUT_DIR gets set to /nobackup/shared/data/pgap
# for real use of the software

# Download latest pgap.py
wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py -O /scratch/pgap/pgap.py
chmod +x pgap.py

# Run the install/data download
python3 pgap.py \
	--docker apptainer \
	--update \
	-n

# Remove old pgap data, if found
rm -rf /nobackup/shared/data/pgap.old

# Rename current pgap data to old
mv /nobackup/shared/data/pgap /nobackup/shared/data/pgap.old

# Move most recent download to current pgap data
mkdir /nobackup/shared/data/pgap
mkdir /nobackup/shared/data/pgap/bin
rsync --progress -a /scratch/pgap/* /nobackup/shared/data/pgap
cp /scratch/pgap/pgap.py /nobackup/shared/data/pgap/bin

rm -rf /scratch/pgap

Shell script wrapper

This is copied to /nobackup/shared/data/pgap/bin and made executable, it's a convenience feature to disable auto-update and ensure that the shared data location is referred to correctly.

#!/bin/bash
export PGAP_INPUT_DIR=/nobackup/shared/data/pgap
python3 $PGAP_INPUT_DIR/bin/pgap.py \
	--docker apptainer \
	-n \
	--no-self-update \
	$@

Module file

Copied to /opt/software/manual/modules/PGAP.

--%Module
-- PGAP 2025-05-06 module file

whatis("Name: PGAP")
whatis("Version: 2025-05-06")
whatis("Description: PGAP is the NCBI Prokaryotic Genome Annotation Pipeline designed to annotate bacterial and archaeal genomes.")

help([[
PGAP 2025-05-06
---------------

Installed in:
  /nobackup/shared/data/pgap

Usage:
  Run `pgap` to run PGAP, you do not need to load Python or Apptainer yourself

Example:
  `pgap -h`

Notes:
  Genetic data and container images used by PGAP have been downloaded to /nobackup/shared/data/pgap, you
  do not need to re-download your own.

  If you want to update PGAP data, please contact the RSE HPC team (hpc.researchcomputing@ncl.ac.uk). 

  For more information: https://hpc.researchcomputing.ncl.ac.uk/dokuwiki/dokuwiki/doku.php?id=advanced:software:pgap

]])

-- Set environment variables and other modules
local root = "/nobackup/shared/data/pgap"
load("apptainer")
load("Python/3.12.3")

-- PGAP shell script
prepend_path("PATH", pathJoin(root, "bin"))

-- Convenience for wrappers, workflows, and users
setenv("PGAP_INPUT_DIR", root)

Fix for Slurm cgroups error

PGAP attempts to set resource limits in Apptainer with the aid of cgroups, but this is already handled by Slurm. By default this will cause an error due to how Apptainer tries to manage the limits. This will manifest in an error such as:

FATAL:   container creation failed: while applying cgroups config: while setting cgroup limits: openat2 /sys/fs/cgroup/user.slice/user-12345.slice/user@12345.service/user.slice/apptainer-2341344.scope/cpu.max: no such file or directory

See: https://github.com/ncbi/pgap/issues/352

This is traced to the following function in pgap.py:

def make_singularity_cmd(self):
        self.cmd = [self.params.docker_cmd, 'exec' ]

        cpusEnv = get_cpus(self)
        if (cpusEnv):
            self.cmd.extend(['--cpus', str(get_cpus(self))])

        if self.params.args.no_internet:
            self.cmd.extend(['--network=none'])

        if (self.params.args.memory):
            self.cmd.extend(['--memory', self.params.args.memory])

The local version of pgap.py has been patched to remove those arguments, which are instead enforced on the entire job runtime by Slurm itself:

def make_singularity_cmd(self):
        self.cmd = [self.params.docker_cmd, 'exec' ]

        cpusEnv = get_cpus(self)
        #if (cpusEnv):
        #    self.cmd.extend(['--cpus', str(get_cpus(self))])

        if self.params.args.no_internet:
            self.cmd.extend(['--network=none'])

        #if (self.params.args.memory):
        #    self.cmd.extend(['--memory', self.params.args.memory])

If updating pgap.py, ensure that those calls are disabled, as above.

Back to software

Table of Contents

pgap

Running pgap on Comet

The Quickstart Example

Updates

Initial setup of pgap on Comet