This is an example scenario, but the principles of how you would approach it (e.g. first building a pipeline to process a single input file, understanding your input data structure, and finally converting your pipeline to a task array) are applicable to all problems which are based on the analysis of many independent data files.
It's not often feasible to jump straight to a parallel task array solution, and we encourage HPC users to approach the problem in several stages:
In this example scenario we are faced with the following problem of analysing a large amount of image data:
We need to analyse files generated by an instrument. We need to transform each image to remove uncessary data or to highlight certain features. We have been provided with tens of thousands of images, all of which need to be processed before we can generate some metrics from the image.
In our example we will use the EuroSAT RGB dataset as our dataset. This is large enough to demonstrate the file quantity issue (over 27000 images), but each image is small enough to not require substantial processing time. In reality you will probably have datasets that are both large in number, as well as in complexity/size.
The processing of image data could mean anything, it may involve heuristics to identify features, the application of some image transformation algorithm or similar. But, in our case, let us assume that the workflow for analysing each input image is a relatively simple sequence of:
We know that processing thousands of these files could be performed on a local desktop, but it may take days or weeks depending on the number (and size) of files we need to analyse. We want to offload the processing to the HPC so that we don't need to keep our local machine running 24×7.
Following this example
If you want to work through this example, you can download the EuroSAT RGB image dataset and set it up with the following script:
$ wget -nc -q https://zenodo.org/records/7711810/files/EuroSAT_RGB.zip?download=1 -O EUROSAT.zip $ unzip -q -n EUROSAT.zip
The resulting directory tree should look like this:
Each of the directories should contain several thousand images, such as:
The first step is to build an image processing pipeline for a single file which performs each step of the workflow. Remember we need to:
Here we have written a very basic Python application which takes the name of an input filename (-file), and the name of an output directory (-out). It then uses the PIL module to do the image transformation we mandated, and then extracts the min, max and mean brightness of the image, finally it saves an output file containing the filename and the pixel values we extracted, to be used in later processing. Download the file below and save it as image_processor.py:
-file
-out
PIL
image_processor.py
#!/usr/bin/env python3 import argparse import sys import os from PIL import Image, ImageStat def process_an_image(filename = None, outdir = None): """ Process an image """ print(f"Opening source image: {filename}") i = Image.open(filename) # Convert to greyscale print(f"Converting to greyscale") i_greyscale = i.convert('L') # Extract pixel data print(f"Extracting pixel data") i_stat = ImageStat.Stat(i_greyscale) # Save metadata for later processing data = f"filename:{filename} min:{i_stat.extrema[0][0]} max:{i_stat.extrema[0][1]} mean:{i_stat.mean[0]}" output_filename = os.path.split(filename)[1] out_name = outdir + "/" + output_filename + ".txt" print(f"Saving pixel data as: {out_name}") f = open(out_name, "w") f.write(data) f.close # Exit sys.exit(0) if __name__ == "__main__": parser = argparse.ArgumentParser("image_processor") parser.add_argument("-file", help="Name of file to process", type=str) parser.add_argument("-out", help="Name of output directory", type=str) args = parser.parse_args() filename = None outdir = None if args.file: filename = args.file if args.out: outdir = args.out if filename and outdir: process_an_image(filename, outdir) else: print("You need to provide -file and -out arguments.") print("See: ./image_processor -h") sys.exit(1)
As this is a relatively lightweight process, on a single image file, we can test this out on a single image file from the EuroSAT dataset by running directly on one of the login nodes as follows:
# Load the Python module to ensure we are not relying on a system version of Python $ module load Python/3.12.3 # Create a directory to hold our output data $ mkdir metadata # Now run our image processor on a single file $ python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1.jpg.txt $
Checking the output file we can see it has recorded the min/max/mean pixel brightness values as intended:
$ cat metadata/Residential_1.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1.jpg min:57 max:253 mean:82.02685546875 $
We have tested this directly at the command line, next we should check it functions correctly under Slurm.
You can now run this as a Slurm job. Here's a simple job script which would process a single image file, just as we did at the command line above. Save the contents of the file below as image_job.sh, and submit as sbatch image_job.sh:
image_job.sh
sbatch image_job.sh
#!/bin/bash #SBATCH --partition=short_free #SBATCH --account=comet_abc123 # Remember to use your own account code #SBATCH --ntasks=1 #SBATCH --mem=1g #SBATCH -t 00:02:00 module load Python/3.12.3 python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata
This should work identically to running it at the command line, though we have to wait for the job to be scheduled. In theory if we were processing very large files then we may have had to submit the Slurm job anyway, as the login node may not have had enough RAM to handle even a single large image. Let's submit the simple job file and check that it does work as intended:
$ sbatch image_job.sh Submitted batch job 1607386 $ cat slurm-1607386.out Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1.jpg.txt $ cat metadata/Residential_1.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1.jpg min:60 max:155 mean:92.23583984375 $
So yes, it works as well under Slurm as it did directly at the command line.
If we didn't have many images to process, we could now just add the additional images in to the script and have them processed in turn, for example:
#!/bin/bash #SBATCH --partition=short_free #SBATCH --account=comet_abc123 #SBATCH --ntasks=1 #SBATCH --mem=1g #SBATCH -t 00:02:00 module load Python/3.12.3 python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_2.jpg -out metadata python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_3.jpg -out metadata python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_4.jpg -out metadata python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_5.jpg -out metadata
However, there are more than 27000 images included in the EuroSAT data set, so it quickly becomes impractical to both list them all individually, and the time to finish increases linearly as we add each image processing command.
But, we do know our analysis pipeline works. This is the first step complete.
We know the image processing pipeline works for a single input file, but you are reading this guide because you presumably have hundreds, thousands if not tens of thousands of data files to analyse.
So how do we do that? Well, at this point there is a decision to be made, and the ease of implementation will vary depending on the question below:
Scenario 1
You have complete control over your input data files naming convention and can set them to be named or numbered sequentially.
Scenario 2
You do not have control over your input data files, or they already exist with some naming convention which needs to be maintained.
By far the easiest mechanism of moving to a task array based solution is if you have control over your input data naming convention. If you are able to arrange your input data files with sequential filenames, then it becomes almost trivial to convert to a task array.
By sequential naming, we refer to a filename which has some part of the name numbered in sequence - it does not necessarily needs to start from 1.
Example sequential numbering schemes:
file1.jpg
file2.jpg
file3.jpg
data.1.dat
data.2.dat
data.3.dat
sequence_data.5000.seq
sequence_data.5001.seq
sequence_data.5002.seq
1001.txt
1002.txt
1003.txt
If you already have files named like this, or can rename them to be like this, then you will find that you can easily convert your image_job.sh job file to a task array with almost zero effort. We'll show this below.
If you have directory trees of input files, naming schemes that are random or otherwise unable to be ordered sequentially by number, then you will need to generate an additional input file where those files can be read sequentially.
Although this sounds complicated, it is actually quite simple. If you have a single directory of files, you can use the ls command and redirection to send all of your filenames to an input file:
ls
Assuming we wanted all of the files in the EuroSAT_RGB/Residential directory we could do:
EuroSAT_RGB/Residential
$ ls EuroSAT_RGB/Residential/*.jpg | sort > residential_files.txt
This command:
.jpg
residential_files.txt
Looking at the top of the residential_files.txt file we can see the files are now listed:
$ head residential_files.txt EuroSAT_RGB/Residential/Residential_1000.jpg EuroSAT_RGB/Residential/Residential_1001.jpg EuroSAT_RGB/Residential/Residential_1002.jpg EuroSAT_RGB/Residential/Residential_1003.jpg EuroSAT_RGB/Residential/Residential_1004.jpg EuroSAT_RGB/Residential/Residential_1005.jpg EuroSAT_RGB/Residential/Residential_1006.jpg EuroSAT_RGB/Residential/Residential_1007.jpg EuroSAT_RGB/Residential/Residential_1008.jpg EuroSAT_RGB/Residential/Residential_1009.jpg $
… and it has all 3000 files:
$ wc -l residential_files.txt 3000 residential_files.txt $
We will use this residential_files.txt input file later, in our task array job.
If we cannot rename our files to be sequentially numbered, then this input file is the best way of moving to a task array based setup. If you have complicated subdirectory trees with your input files spread across multiple directories, then you may need to use something like find to list your files, rather than the simple ls command.
find
How you write a task array job file is determined by whether you have sequentially named input data, or are using the input data file method listed above.
To convert our existing image_job.sh job script to a task array, we can make the following modifications:
–array=
The numbers of the tasks Slurm will start for us get turned in to the variable ${SLURM_ARRAY_TASK_ID}. In the small example below, we are asking Slurm to launch four tasks and each task will get the number from 1 to 4 (–array=1-4).
${SLURM_ARRAY_TASK_ID}
–array=1-4
Save the file below as image_job_seq.sh and run as sbatch image_job_seq.sh:
image_job_seq.sh
sbatch image_job_seq.sh
#!/bin/bash #SBATCH --partition=default_free #SBATCH --account=comet_abc123 # Remember to use your own account code #SBATCH --mem=1g #SBATCH --nodes=1 #SBATCH --tasks=4 #SBATCH --array=1-4 #SBATCH --cpus-per-task=1 module load Python/3.12.3 python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_${SLURM_ARRAY_TASK_ID}.jpg -out metadata
We use the fact each task gets a unique number to make each task also use a differently named input image file. In this case Slurm will start the four tasks and the commands run by each one will be one of:
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_2.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_3.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_4.jpg -out metadata
You don't need to do anything else - Slurm will automatically take care of launching each image_processor.py task with a differently named input file, and since we wrote image_processor.py to work completely independently of any other file, there will be no conflicts and each task will run simultaneously with the other three.
This thereby gives us a x4 speed up over processing those four images one after each other.
To convert our existing image_job.sh job script to a task array and instead use an input file, we can make the following modifications:
The numbers of the tasks Slurm will start for us get turned in to the variable ${SLURM_ARRAY_TASK_ID}. In the small example below, we are asking Slurm to launch four and each task will get the number from 1 to 4 (–array=1-4).
Save the file below as image_job_inputfile.sh and run with sbatch image_job_inputfile.sh:
image_job_inputfile.sh
sbatch image_job_inputfile.sh
#!/bin/bash #SBATCH --partition=default_free #SBATCH --account=comet_abc123 # Remember to use your own account code #SBATCH --mem=1g #SBATCH --nodes=1 #SBATCH --array=1-4 #SBATCH --cpus-per-task=1 module load Python/3.12.3 # This next line generates a unique input data file for each task which is launched, as long as we generated the 'residential_files.txt' input as # per our earlier example. INPUT_FILE=$(awk NR==${SLURM_ARRAY_TASK_ID} residential_files.txt) # Then each task launches image_processor with a unique INPUT_FILE name python3 image_processor.py -file $INPUT_FILE -out metadata
Because we don't have sequentially numbered input files we use a little trick with the awk command to read a specific line from our input file. In this case each task reads its own line from the input file; task one reading the filename on line one, task two reading the filename on line two, and so on.
awk
This therefore gets exactly the same x4 speed up as the previous example. The only difference is the potential order the files are processed and that they don't need to be sequentially named.
This achieves the same goal as the sequentially numbered task array example (four tasks launched, each reading a different input file), and while the awk line is a little opaque for those who have not written a great deal of shell script, you can treat it as an automatic filename supplier, as long as you give it your input file (in this case, residential_files.txt we created earlier).
First, let's test the sequentially numbered task array solution:
$ sbatch image_job_seq.sh Submitted batch job 1607653 # Check for output logs $ ls slurm-1607653* slurm-1607653_1.out slurm-1607653_2.out slurm-1607653_3.out slurm-1607653_4.out # See contents of output logs $ cat slurm-1607653* Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_2.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_2.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_3.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_3.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_4.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_4.jpg.txt $
Has the data been generated correctly?
$ ls metadata/ Residential_1.jpg.txt Residential_2.jpg.txt Residential_3.jpg.txt Residential_4.jpg.txt $ cat metadata/Residential_1.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1.jpg min:60 max:155 mean:92.23583984375 $ cat metadata/Residential_2.jpg.txt filename:EuroSAT_RGB/Residential/Residential_2.jpg min:54 max:254 mean:115.12548828125 $ cat metadata/Residential_3.jpg.txt filename:EuroSAT_RGB/Residential/Residential_3.jpg min:62 max:255 mean:102.17138671875 $ cat metadata/Residential_4.jpg.txt filename:EuroSAT_RGB/Residential/Residential_4.jpg min:56 max:158 mean:90.14990234375 $
Yes, the four files which were processed in parallel were analysed successfully, using our existing, sequential image_process.py application.
image_process.py
Now, let's test the solution which uses the input data file, and does not rely on sequentially named image files.
# Submit job $ sbatch image_job_inputfile.sh Submitted batch job 1607718 # Did Slurm logs get created for each task? $ ls slurm-1607718_* slurm-1607718_1.out slurm-1607718_2.out slurm-1607718_3.out slurm-1607718_4.out # Check Slurm output logs $ cat slurm-1607718_* Opening source image: EuroSAT_RGB/Residential/Residential_1000.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1000.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_1001.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1001.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_1002.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1002.jpg.txt Opening source image: EuroSAT_RGB/Residential/Residential_1003.jpg Converting to greyscale Extracting pixel data Saving pixel data as: metadata/Residential_1003.jpg.txt $
So it seems to have ran okay (note that it processed the files in a different order, based on the sort method with used to generate residential_files.txt). What about the analysed data files?
sort
# Did the output data files get created? $ ls metadata/ Residential_1000.jpg.txt Residential_1001.jpg.txt Residential_1002.jpg.txt Residential_1003.jpg.txt # Check output data $ cat metadata/Residential_1000.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1000.jpg min:47 max:218 mean:94.780517578125 $ cat metadata/Residential_1001.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1001.jpg min:61 max:254 mean:85.656982421875 $ cat metadata/Residential_1002.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1002.jpg min:66 max:254 mean:114.40478515625 $ cat metadata/Residential_1003.jpg.txt filename:EuroSAT_RGB/Residential/Residential_1003.jpg min:59 max:153 mean:79.7548828125
Again, yes; the four parallel tasks ran successfully, and they each generated (presumably) correct data from their independent image files.
We can be confident that both the sequential filename method and the input data file method do produce the same output - although the ordering of the image files could differ between the two, depending on how we order/sort the input file.
But hold on, we have thousands of input files, and we have only processed four of them! Well yes; as always we should start off small and test that things work before we scale up. Read the next section to understand how to scale up your task arrays jobs.
If we look at the SBATCH headers we set in both input_job_seq. and input_job_inputfile.sh they are the same:
SBATCH
input_job_seq.
input_job_inputfile.sh
#SBATCH --partition=default_free #SBATCH --account=comet_abc123 # Remember to use your own account code #SBATCH --mem=1g #SBATCH --nodes=1 #SBATCH --array=1-4 #SBATCH --cpus-per-task=1
How we increase the number of parallel copies of image_process.py is to configure the –array values to better suit our needs. Remember that:
–array
It would seem that we could increase tasks to some huge number to run all of our input data files in parallel… but we are constrained by several factors:
tasks
When increasing the array figure the form below is used:
array
The %max_running argument is optional, and if not set, the number of running tasks is the sequence from start to end, e.g:
%max_running
# Four running tasks, numbered 1, 2, 3, 4 #SBATCH --array=1-4
# Twenty eight running tasks, numbered 100 to 127 #SBATCH --array=100-127
If you include the %max_running argument, then this applies a limit to the simultaneous running tasks, e.g.:
# Four tasks, limited to two running at a time, numbered 1, 2, 3, 4 #SBATCH --array=1-4%2
# Twenty eight tasks, limited to five running at a time, numbered 100 to 127 #SBATCH --array=100-127%5
A key point to remember is that Slurm will create a job entry for each element of your array. If you have an array which defines 28 entries, then you will get 28 new jobs created… if you have an array with 200 entries then you will create 200 new Slurm job entries.
This means that large arrays can create huge numbers of Slurm jobs very, very quickly. For this reason we do currently have some limits set on the number of jobs both users and their entire group can create (see Resource Limits - Understanding what they mean for more details on the current limits - MaxSubmitJobs and GrpSubmitJobs limits are applicable to the maximum size of the –array setting, but also take in to account everything else currently running or pending against your account code).
If you have huge numbers of data files to process, then you may need to approach this in several batches to avoid overload Slurm and the job submission system, and to stay inside the limits enforced on your project (you can use our sproject tool from Simple Slurm Tools to quickly check your project limits).
sproject
Workarounds for large numbers of files/arrays and staying under MaxSubmit and GrpSubmit limits.
You can download all of the scripts used in this example using the link below:
Back to Advanced Slurm Job Optimisation
Table of Contents
HPC Service
Main Content Sections
Documentation Tools