This is a project which is currently making use of HPC facilities at Newcastle University. It is active.
For further information about this project, please contact:
This project uses HPC resources for large-scale text corpora processing and analysis for research in accounting and finance. The corpus contains tens of thousands of transcripts spanning multiple years, and the study requires sentence-level processing, resulting in several million sentence records. The work involves building and repeatedly re-running an NLP pipeline (parsing/cleaning, sentence segmentation, tokenisation, and feature extraction) as preprocessing settings and model specifications are iteratively refined. HPC is needed because end-to-end reprocessing at this scale is computationally and memory intensive, running the full workflow on a personal computer is prohibitively slow and constrains other research tasks. Access to HPC will enable faster, more reliable processing and iteration, supporting reproducible text-based measures for downstream empirical analyses and research outputs.
This Python-based workflow utilizes Stanford CoreNLP and spaCy/NLTK for large-scale text processing and word embedding extraction. GPUs and high-memory CPU nodes are required for Transformer-based inference and memory-intensive parsing of millions of records. HPC enables the parallelization of these compute-heavy tasks, ensuring efficient and reproducible model iteration at scale.