====== pgap ======

> NCBI Prokaryotic Genome Annotation Pipeline
>
> The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).
>
> Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes.
>
> NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms. Post-annotation, the completeness of the annotated gene set is estimated with CheckM.
>
> The workflow provided here also offers the option to confirm or correct the organism associated with the genome assembly prior to starting the annotation, using the Average Nucleotide Identity tool.

   * See: https://github.com/ncbi/pgap

----

===== Running pgap on Comet =====

PGAP uses **Python** and **Apptainer** to run. It also downloads a significant amount of image data on which the pipelines are based. To prevent duplicate data **we have already downloaded this data and set up the PGAP module to work from a central, shared image data directory**.

You can run ''pgap'' as follows:

<code lang=bash>
$ module load PGAP
$ pgap -h
usage: pgap.py [-h] [-g GENOME] [-s ORGANISM] [-V] [-v] [--taxcheck | --taxcheck-only] [--auto-correct-tax]
               [-l | -u] [-r | -n] [--container-name CONTAINER_NAME] [--container-path CONTAINER_PATH]
               [--ignore-all-errors] [--no-internet] [-D path] [-o path] [-q] [--prefix PREFIX]
               [--no-self-update] [-c CPUS] [-m MEMORY] [-d]

Input must be provided as:
 1. a fasta/organism pair, e.g.
    pgap.py ... -g input.fasta -s 'Escherichia coli'
or
 2. a YAML configuration file, e.g.
    pgap.py ... input.yaml

options:
  -h, --help            show this help message and exit
$
</code>

For more details, including how to run under Slurm, see the //Quickstart Example// below.
==== The Quickstart Example ====

Following the [[https://github.com/ncbi/pgap/wiki/Quick-Start|Quick Start]] example from the PGAP wiki page, we can run it on Comet as follows:

<code lang=bash>
#!/bin/bash

#SBATCH --account=my_account
#SBATCH --partition=default_free
#SBATCH -c 8
#SBATCH --mem=16G

module load PGAP
pgap \
   -o $HOME/mg37_results \
   -g $PGAP_INPUT_DIR/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta \
   -s "Mycoplasmoides genitalium"
</code>

Note the following:

   * We call ''pgap'' instead of ''pgap.py''
   * We use ''$PGAP_INPUT_DIR'' to refer to the location of the already-download image data

----

===== Updates =====

Periodically there are updates to the ''pgap.py'' script, as well as the image data used by PGAP. If you need PGAP updating, then please [[contact:index|Contact us]], and we will update the PGAP install and data for //all// users of Comet - please //do not// download and install another version of PGAP just for yourself.

----

===== Initial setup of pgap on Comet =====

<WRAP round box important>
**Important!**

This section is only of relevance to RSE HPC administrators or users who wish to understand how pgap was initially set up. If you are only interested in //using// pgap, then stop reading here.
</WRAP>

**Download commands**

There's no real //install// process for PGAP, just ensuring that the download is to the correct (i.e. //shared//) location, so that we do not end up with multiple copies of the image data.

<code lang=bash>
#!/bin/bash

# Perform the initial download of pgap containers and datasets
module load apptainer
module load Python/3.12.3

# For some reason, pgap will not
# install correctly if downloaded
# directly on to Lustre. Instead,
# build it in /scratch

# See https://github.com/ncbi/pgap/issues/333
export PGAP_INPUT_DIR=/scratch/pgap
mkdir /scratch/pgap

# Afterwards, PGAP_INPUT_DIR gets set to /nobackup/shared/data/pgap
# for real use of the software

# Download latest pgap.py
wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py -O /scratch/pgap/pgap.py
chmod +x pgap.py

# Run the install/data download
python3 pgap.py \
	--docker apptainer \
	--update \
	-n

# Remove old pgap data, if found
rm -rf /nobackup/shared/data/pgap.old

# Rename current pgap data to old
mv /nobackup/shared/data/pgap /nobackup/shared/data/pgap.old

# Move most recent download to current pgap data
mkdir /nobackup/shared/data/pgap
mkdir /nobackup/shared/data/pgap/bin
rsync --progress -a /scratch/pgap/* /nobackup/shared/data/pgap
cp /scratch/pgap/pgap.py /nobackup/shared/data/pgap/bin

rm -rf /scratch/pgap
</code>

**Shell script wrapper**

This is copied to ''/nobackup/shared/data/pgap/bin'' and made executable, it's a //convenience// feature to disable auto-update and ensure that the shared data location is referred to correctly.

<code lang=bash title=pgap>
#!/bin/bash
export PGAP_INPUT_DIR=/nobackup/shared/data/pgap
python3 $PGAP_INPUT_DIR/bin/pgap.py \
	--docker apptainer \
	-n \
	--no-self-update \
	$@
</code>

**Module file**

Copied to ''/opt/software/manual/modules/PGAP''.

<code lang=lua title=2025-05-06.lua>
--%Module
-- PGAP 2025-05-06 module file

whatis("Name: PGAP")
whatis("Version: 2025-05-06")
whatis("Description: PGAP is the NCBI Prokaryotic Genome Annotation Pipeline designed to annotate bacterial and archaeal genomes.")

help([[
PGAP 2025-05-06
---------------

Installed in:
  /nobackup/shared/data/pgap

Usage:
  Run `pgap` to run PGAP, you do not need to load Python or Apptainer yourself

Example:
  `pgap -h`

Notes:
  Genetic data and container images used by PGAP have been downloaded to /nobackup/shared/data/pgap, you
  do not need to re-download your own.

  If you want to update PGAP data, please contact the RSE HPC team (hpc.researchcomputing@ncl.ac.uk). 

  For more information: https://hpc.researchcomputing.ncl.ac.uk/dokuwiki/dokuwiki/doku.php?id=advanced:software:pgap

]])

-- Set environment variables and other modules
local root = "/nobackup/shared/data/pgap"
load("apptainer")
load("Python/3.12.3")

-- PGAP shell script
prepend_path("PATH", pathJoin(root, "bin"))

-- Convenience for wrappers, workflows, and users
setenv("PGAP_INPUT_DIR", root)
</code>

**Fix for Slurm cgroups error**

PGAP attempts to set resource limits in Apptainer with the aid of //cgroups//, but this is already handled by Slurm. By default this will cause an error due to how Apptainer tries to manage the limits. This will manifest in an error such as:

<code>
FATAL:   container creation failed: while applying cgroups config: while setting cgroup limits: openat2 /sys/fs/cgroup/user.slice/user-12345.slice/user@12345.service/user.slice/apptainer-2341344.scope/cpu.max: no such file or directory
</code>

   * See: https://github.com/ncbi/pgap/issues/352

This is traced to the following function in ''pgap.py'':

<code lang=python title=pgap.py.orig>
def make_singularity_cmd(self):
        self.cmd = [self.params.docker_cmd, 'exec' ]

        cpusEnv = get_cpus(self)
        if (cpusEnv):
            self.cmd.extend(['--cpus', str(get_cpus(self))])

        if self.params.args.no_internet:
            self.cmd.extend(['--network=none'])

        if (self.params.args.memory):
            self.cmd.extend(['--memory', self.params.args.memory])
</code>

The local version of ''pgap.py'' has been patched to remove those arguments, which are instead enforced on the entire job runtime by Slurm itself:

<code lang=python title=pgap.py>
def make_singularity_cmd(self):
        self.cmd = [self.params.docker_cmd, 'exec' ]

        cpusEnv = get_cpus(self)
        #if (cpusEnv):
        #    self.cmd.extend(['--cpus', str(get_cpus(self))])

        if self.params.args.no_internet:
            self.cmd.extend(['--network=none'])

        #if (self.params.args.memory):
        #    self.cmd.extend(['--memory', self.params.args.memory])
</code>

<WRAP round box important>
If updating ''pgap.py'', ensure that those calls are disabled, as above.
</WRAP>

----

[[:advanced:software|Back to software]]