====== pgap ====== This software and the user guide is still under development - it is not yet available to use on Comet. > NCBI Prokaryotic Genome Annotation Pipeline > > The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids). > > Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes. > > NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms. Post-annotation, the completeness of the annotated gene set is estimated with CheckM. > > The workflow provided here also offers the option to confirm or correct the organism associated with the genome assembly prior to starting the annotation, using the Average Nucleotide Identity tool. * See: https://github.com/ncbi/pgap ---- ===== Running pgap on Comet ===== PGAP uses **Python** and **Apptainer** to run. It also downloads a significant amount of image data on which the pipelines are based. To prevent duplicate data we have already downloaded this data and set up the PGAP module to work from a central, shared image data directory. You can run ''pgap'' as follows: $ module load PGAP $ pgap -h usage: pgap.py [-h] [-g GENOME] [-s ORGANISM] [-V] [-v] [--taxcheck | --taxcheck-only] [--auto-correct-tax] [-l | -u] [-r | -n] [--container-name CONTAINER_NAME] [--container-path CONTAINER_PATH] [--ignore-all-errors] [--no-internet] [-D path] [-o path] [-q] [--prefix PREFIX] [--no-self-update] [-c CPUS] [-m MEMORY] [-d] Input must be provided as: 1. a fasta/organism pair, e.g. pgap.py ... -g input.fasta -s 'Escherichia coli' or 2. a YAML configuration file, e.g. pgap.py ... input.yaml options: -h, --help show this help message and exit $ For more details, including how to run under Slurm, see the //Quickstart Example// below. ==== The Quickstart Example ==== Following the [[https://github.com/ncbi/pgap/wiki/Quick-Start|Quick Start]] example from the PGAP wiki page, we can run it on Comet as follows: #!/bin/bash #SBATCH --account=my_account #SBATCH --partition=default_free #SBATCH -c 8 #SBATCH --mem=16G module load PGAP pgap \ -o $HOME/mg37_results \ -g $PGAP_INPUT_DIR/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta \ -s "Mycoplasmoides genitalium" Note the following: * We call ''pgap'' instead of ''pgap.py'' * We use ''$PGAP_INPUT_DIR'' to refer to the location of the already-download image data ---- ===== Updates ===== Periodically there are updates to the ''pgap.py'' script, as well as the image data used by PGAP. If you need PGAP updating, then please [[contact:index|Contact us]], and we will update the PGAP install and data for //all// users of Comet - please //do not// download and install another version of PGAP just for yourself. ---- ===== Initial setup of pgap on Comet ===== **Important!** This section is only of relevance to RSE HPC administrators or users who wish to understand how pgap was initially set up. If you are only interested in //using// pgap, then stop reading here. **Download commands** There's no real //install// process for PGAP, just ensuring that the download is to the correct (i.e. //shared//) location, so that we do not end up with multiple copies of the image data. #!/bin/bash # Perform the initial download of pgap containers and datasets module load apptainer module load Python/3.12.3 # For some reason, pgap will not # install correctly if downloaded # directly on to Lustre. Instead, # build it in /scratch # See https://github.com/ncbi/pgap/issues/333 export PGAP_INPUT_DIR=/scratch/pgap mkdir /scratch/pgap # Afterwards, PGAP_INPUT_DIR gets set to /nobackup/shared/data/pgap # for real use of the software # Download latest pgap.py wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py -O /scratch/pgap/pgap.py chmod +x pgap.py # Run the install/data download python3 pgap.py \ --docker apptainer \ --update \ -n # Remove old pgap data, if found rm -rf /nobackup/shared/data/pgap.old # Rename current pgap data to old mv /nobackup/shared/data/pgap /nobackup/shared/data/pgap.old # Move most recent download to current pgap data mkdir /nobackup/shared/data/pgap mkdir /nobackup/shared/data/pgap/bin rsync --progress -a /scratch/pgap/* /nobackup/shared/data/pgap cp /scratch/pgap/pgap.py /nobackup/shared/data/pgap/bin rm -rf /scratch/pgap **Shell script wrapper** This is copied to ''/nobackup/shared/data/pgap/bin'' and made executable, it's a //convenience// feature to disable auto-update and ensure that the shared data location is referred to correctly. #!/bin/bash export PGAP_INPUT_DIR=/nobackup/shared/data/pgap python3 $PGAP_INPUT_DIR/bin/pgap.py \ --docker apptainer \ -n \ --no-self-update \ $@ **Module file** Copied to ''/opt/software/manual/modules/PGAP''. --%Module -- PGAP 2025-05-06 module file whatis("Name: PGAP") whatis("Version: 2025-05-06") whatis("Description: PGAP is the NCBI Prokaryotic Genome Annotation Pipeline designed to annotate bacterial and archaeal genomes.") help([[ PGAP 2025-05-06 --------------- Installed in: /nobackup/shared/data/pgap Usage: Run `pgap` to run PGAP, you do not need to load Python or Apptainer yourself Example: `pgap -h` Notes: Genetic data and container images used by PGAP have been downloaded to /nobackup/shared/data/pgap, you do not need to re-download your own. If you want to update PGAP data, please contact the RSE HPC team (hpc.researchcomputing@ncl.ac.uk). For more information: https://hpc.researchcomputing.ncl.ac.uk/dokuwiki/dokuwiki/doku.php?id=advanced:software:pgap ]]) -- Set environment variables and other modules local root = "/nobackup/shared/data/pgap" load("apptainer") load("Python/3.12.3") -- PGAP shell script prepend_path("PATH", pathJoin(root, "bin")) -- Convenience for wrappers, workflows, and users setenv("PGAP_INPUT_DIR", root) **Fix for Slurm cgroups error** PGAP attempts to set resource limits in Apptainer with the aid of //cgroups//, but this is already handled by Slurm. By default this will cause an error due to how Apptainer tries to manage the limits. This will manifest in an error such as: FATAL: container creation failed: while applying cgroups config: while setting cgroup limits: openat2 /sys/fs/cgroup/user.slice/user-12345.slice/user@12345.service/user.slice/apptainer-2341344.scope/cpu.max: no such file or directory * See: https://github.com/ncbi/pgap/issues/352 This is traced to the following function in ''pgap.py'': def make_singularity_cmd(self): self.cmd = [self.params.docker_cmd, 'exec' ] cpusEnv = get_cpus(self) if (cpusEnv): self.cmd.extend(['--cpus', str(get_cpus(self))]) if self.params.args.no_internet: self.cmd.extend(['--network=none']) if (self.params.args.memory): self.cmd.extend(['--memory', self.params.args.memory]) The local version of ''pgap.py'' has been patched to remove those arguments, which are instead enforced on the entire job runtime by Slurm itself: def make_singularity_cmd(self): self.cmd = [self.params.docker_cmd, 'exec' ] cpusEnv = get_cpus(self) #if (cpusEnv): # self.cmd.extend(['--cpus', str(get_cpus(self))]) if self.params.args.no_internet: self.cmd.extend(['--network=none']) #if (self.params.args.memory): # self.cmd.extend(['--memory', self.params.args.memory]) If updating ''pgap.py'', ensure that those calls are disabled, as above. ---- [[:advanced:software|Back to software]]