====== Building a Parallel Task Array Solution ======

<WRAP info round box>
This is an //example// scenario, but the principles of how you would approach it (e.g. first building a pipeline to process a single input file, understanding your input data structure, and finally converting your pipeline to a task array) are applicable to __all__ problems which are based on the analysis of many //independent// data files.
</WRAP>

===== Steps to a solution =====

It's not often feasible to jump straight to a parallel task array solution, and we encourage HPC users to approach the problem in several stages:

   * Identify the problem
   * Build a single job solution
   * Test the single job process
   * Refactoring your input data
   * Building a multi job solution
   * Testing a multi job solution
   * Scaling up & resource limits

----


===== Identify the problem =====

In this example scenario we are faced with the following problem of analysing a large amount of image data:

> We need to analyse files generated by an instrument. 
> We need to transform each image to remove uncessary data or to highlight certain features.
> We have been provided with //tens of thousands// of images, all of which need to be processed before we can generate some metrics from the image.

In our example we will use the [[https://zenodo.org/records/7711810#.ZAm3k-zMKEA|EuroSAT RGB dataset]] as our dataset. This is large enough to demonstrate the file //quantity// issue (over 27000 images), but each image is //small enough// to not require substantial processing time. In reality you will probably have datasets that are both large in number, //as well as// in complexity/size.

The processing of image data could mean anything, it may involve heuristics to identify features, the application of some image transformation algorithm or similar. //But//, in our case, let us //assume// that the workflow for analysing each input image is a relatively simple sequence of:

   * Converting the input file from RGB to greyscale
   * Extracting the min, max and median brightness of all pixels in the image
   * Recording the brightness values of the image filename for later analysis

We know that processing thousands of these files //could// be performed on a local desktop, but it may take days or weeks depending on the number (and size) of files we need to analyse. We want to offload the processing to the HPC so that we don't need to keep our local machine running 24x7.

<WRAP tip round box>
**Following this example**

If you want to work through this example, you can download the EuroSAT RGB image dataset and set it up with the following script:

<code lang=bash>
$ wget -nc -q https://zenodo.org/records/7711810/files/EuroSAT_RGB.zip?download=1 -O EUROSAT.zip
$ unzip -q -n EUROSAT.zip
</code>

The resulting directory tree should look like this:

   * EuroSAT_RGB
   * EuroSAT_RGB/Forest
   * EuroSAT_RGB/Highway
   * EuroSAT_RGB/Residential
   * ...

Each of the directories should contain several //thousand// images, such as:

{{:advanced:eurosat_examples.png?1000|}}
</WRAP>

----

===== A single job solution =====

The first step is to build an image processing pipeline for a //single// file which performs each step of the workflow. Remember we need to:

   * Grab the name of the input image file
   * Convert the file from a colour RGB image to greyscale
   * Calculate the minimum, maximum and mean brightness of the image
   * Save the brightness value and filename for later processing

Here we have written a very basic Python application which takes the name of an input filename (''-file''), and the name of an output directory (''-out''). It then uses the ''PIL'' module to do the image transformation we mandated, and then extracts the min, max and mean brightness of the image, finally it saves an output file containing the filename and the pixel values we extracted, to be used in later processing. Download the file below and save it as ''image_processor.py'':

<code lang=python title=image_processor.py>
#!/usr/bin/env python3
import argparse
import sys
import os
from PIL import Image, ImageStat

def process_an_image(filename = None, outdir = None):
	""" Process an image """
	print(f"Opening source image: {filename}")
	i = Image.open(filename)

	# Convert to greyscale
	print(f"Converting to greyscale")
	i_greyscale = i.convert('L')	

	# Extract pixel data
	print(f"Extracting pixel data")
	i_stat = ImageStat.Stat(i_greyscale)

	# Save metadata for later processing
	data = f"filename:{filename} min:{i_stat.extrema[0][0]} max:{i_stat.extrema[0][1]} mean:{i_stat.mean[0]}"
	output_filename = os.path.split(filename)[1]
	out_name = outdir + "/" + output_filename + ".txt"
	
	print(f"Saving pixel data as: {out_name}")
	f = open(out_name, "w")
	f.write(data)
	f.close

	# Exit
	sys.exit(0)


if __name__ == "__main__":
	parser = argparse.ArgumentParser("image_processor")
	parser.add_argument("-file", help="Name of file to process", type=str)
	parser.add_argument("-out", help="Name of output directory", type=str)
	args = parser.parse_args()
	filename = None
	outdir = None

	if args.file:
		filename = args.file

	if args.out:
		outdir = args.out		

	if filename and outdir:
		process_an_image(filename, outdir)
	else:
		print("You need to provide -file and -out arguments.")
		print("See: ./image_processor -h")
		sys.exit(1)
</code>

As this is a relatively lightweight process, on a single image file, we can test this out on a single image file from the EuroSAT dataset by running directly on one of the login nodes as follows:

<code lang=bash>
# Load the Python module to ensure we are not relying on a system version of Python
$ module load Python/3.12.3
# Create a directory to hold our output data
$ mkdir metadata
# Now run our image processor on a single file
$ python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata
Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1.jpg.txt
$
</code>

Checking the output file we can see it has recorded the min/max/mean pixel brightness values as intended:

<code lang=bash>
$ cat metadata/Residential_1.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1.jpg min:57 max:253 mean:82.02685546875
$
</code>

We have tested this directly at the command line, next we should check it functions correctly under Slurm.

----

===== Testing a single job =====

You can now run this as a Slurm job. Here's a simple job script which would process a single image file, just as we did at the command line above. Save the contents of the file below as ''image_job.sh'', and submit as ''sbatch image_job.sh'':

<code lang=bash title=image_job.sh>
#!/bin/bash

#SBATCH --partition=short_free
#SBATCH --account=comet_abc123 # Remember to use your own account code
#SBATCH --ntasks=1
#SBATCH --mem=1g
#SBATCH -t 00:02:00

module load Python/3.12.3
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata
</code>

This should work //identically// to running it at the command line, though we have to wait for the job to be scheduled. In theory if we were processing //very// large files then we may have had to submit the Slurm job //anyway//, as the login node may not have had enough RAM to handle even a single large image. Let's submit the simple job file and check that it //does// work as intended:

<code lang=bash>
$ sbatch image_job.sh 
Submitted batch job 1607386
$ cat slurm-1607386.out 
Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1.jpg.txt
$ cat metadata/Residential_1.jpg.txt
filename:EuroSAT_RGB/Residential/Residential_1.jpg min:60 max:155 mean:92.23583984375
$
</code>

So **yes**, it works as well under Slurm as it did directly at the command line.

If we didn't have many images to process, we //could// now just add the additional images in to the script and have them processed in turn, for example:

<code lang=bash title=image_job.sh>
#!/bin/bash

#SBATCH --partition=short_free
#SBATCH --account=comet_abc123
#SBATCH --ntasks=1
#SBATCH --mem=1g
#SBATCH -t 00:02:00

module load Python/3.12.3
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_2.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_3.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_4.jpg -out metadata
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_5.jpg -out metadata
</code>

However, there are __more than 27000__ images included in the EuroSAT data set, so it quickly becomes impractical to both list them all individually, and the time to finish increases linearly as we add each image processing command.

But, we do know our analysis pipeline //works//. This is the first step complete.

----


===== Refactoring your input data =====

We know the image processing pipeline works for a single input file, but you are reading this guide because you presumably have hundreds, thousands if not //tens// of thousands of data files to analyse.

So how do we do that? Well, at this point there is a decision to be made, and the ease of implementation will vary depending on the question below:

<WRAP group>

<WRAP box round half column>
**Scenario 1**

You have __complete control__ over your input data files naming convention and can set them to be named or numbered sequentially.
</WRAP>

<WRAP box round half column>
**Scenario 2**

You do __not__ have control over your input data files, or they already exist with some naming convention which needs to be __maintained__.
</WRAP>

</WRAP>

==== Scenario 1 - Control of file naming ====

By far the easiest mechanism of moving to a task array based solution is if you have control over your input data naming convention. If you are able to arrange your input data files with **sequential filenames**, then it becomes almost trivial to convert to a task array. 

By sequential naming, we refer to a filename which has some part of the name numbered in sequence - it does //not necessarily needs to start from 1//.

Example sequential numbering schemes:

   * ''file1.jpg'', ''file2.jpg'', ''file3.jpg''
   * ''data.1.dat'', ''data.2.dat'', ''data.3.dat''
   * ''sequence_data.5000.seq'', ''sequence_data.5001.seq'', ''sequence_data.5002.seq''
   * ''1001.txt'', ''1002.txt'', ''1003.txt''
   * etc.

If you already have files named like this, //or can rename them to be like this//, then you will find that you can easily convert your ''image_job.sh'' job file to a task array with almost zero effort. We'll show this below.

==== Scenario 2 - Existing file naming conventions ====

If you have directory trees of input files, naming schemes that are random or otherwise unable to be ordered sequentially by number, then you will need to generate an additional **input file** where those files //can// be read sequentially.

Although this sounds complicated, it is actually quite simple. If you have a single directory of files, you can use the ''ls'' command and //redirection// to send all of your filenames to an input file:

Assuming we wanted all of the files in the ''EuroSAT_RGB/Residential'' directory we could do:

<code lang=bash>
$ ls EuroSAT_RGB/Residential/*.jpg | sort > residential_files.txt
</code>

This command:

   * Lists all of the files ending in ''.jpg'' in the ''EuroSAT_RGB/Residential'' directory
   * Sorts them by name
   * Send the sorted list to a file named ''residential_files.txt''

Looking at the top of the ''residential_files.txt'' file we can see the files are now listed:

<code lang=bash>
$ head residential_files.txt 
EuroSAT_RGB/Residential/Residential_1000.jpg
EuroSAT_RGB/Residential/Residential_1001.jpg
EuroSAT_RGB/Residential/Residential_1002.jpg
EuroSAT_RGB/Residential/Residential_1003.jpg
EuroSAT_RGB/Residential/Residential_1004.jpg
EuroSAT_RGB/Residential/Residential_1005.jpg
EuroSAT_RGB/Residential/Residential_1006.jpg
EuroSAT_RGB/Residential/Residential_1007.jpg
EuroSAT_RGB/Residential/Residential_1008.jpg
EuroSAT_RGB/Residential/Residential_1009.jpg
$
</code>

... and it has all //3000// files:

<code lang=bash>
$ wc -l residential_files.txt 
3000 residential_files.txt
$
</code>

**We will use this ''residential_files.txt'' input file later, in our task array job**.

If we //cannot// rename our files to be sequentially numbered, then this input file is the best way of moving to a task array based setup. If you have complicated subdirectory trees with your input files spread across multiple directories, then you may need to use something like ''find'' to list your files, rather than the simple ''ls'' command.

----

===== Building a task array solution ===== 

How you write a task array job file is determined by whether you have sequentially named input data, or are using the input data file method listed above. 
==== Task array solution - sequential filenames ====

To convert our existing ''image_job.sh'' job script to a task array, we can make the following modifications:

   * Add the ''--array='' parameter, which tells Slurm the //quantity// and //numbers// of the tasks we want to run - this range should correspond to the sequential numbering range of your input files

The //numbers// of the tasks Slurm will start for us get turned in to the variable ''${SLURM_ARRAY_TASK_ID}''. In the //small// example below, we are asking Slurm to launch **four** tasks and each task will get the number from **1 to 4** (''--array=1-4'').

Save the file below as ''image_job_seq.sh'' and run as ''sbatch image_job_seq.sh'':

<code lang=bash title=image_job_seq.sh>
#!/bin/bash

#SBATCH --partition=default_free
#SBATCH --account=comet_abc123 # Remember to use your own account code
#SBATCH --mem=1g
#SBATCH --nodes=1
#SBATCH --tasks=4
#SBATCH --array=1-4
#SBATCH --cpus-per-task=1

module load Python/3.12.3
python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_${SLURM_ARRAY_TASK_ID}.jpg -out metadata
</code>

We use the fact each task gets a unique number to make each task //also// use a differently named //input image file//. In this case Slurm will start the four tasks and the commands run by each one will be one of:

<code>python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_1.jpg -out metadata</code>
<code>python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_2.jpg -out metadata</code>
<code>python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_3.jpg -out metadata</code>
<code>python3 image_processor.py -file EuroSAT_RGB/Residential/Residential_4.jpg -out metadata</code>

You don't need to do anything else - Slurm will automatically take care of launching each ''image_processor.py'' task with a differently named input file, and since we wrote ''image_processor.py'' to work completely independently of any other file, there will be no conflicts and each task will run simultaneously with the other three.

This thereby gives us a **x4** speed up over processing those four images one after each other.
==== Task array solution - using an input data file ====

To convert our existing ''image_job.sh'' job script to a task array //and instead use an input file//, we can make the following modifications:

   * Add the ''--array='' parameter, which tells Slurm the //quantity// and //numbers// of the tasks we want to run - this range should correspond to the sequential numbering range of your input files

The //numbers// of the tasks Slurm will start for us get turned in to the variable ''${SLURM_ARRAY_TASK_ID}''. In the //small// example below, we are asking Slurm to launch **four** and each task will get the number from **1 to 4** (''--array=1-4'').

Save the file below as ''image_job_inputfile.sh'' and run with ''sbatch image_job_inputfile.sh'':

<code lang=bash title=image_job_inputfile.sh>
#!/bin/bash

#SBATCH --partition=default_free
#SBATCH --account=comet_abc123 # Remember to use your own account code
#SBATCH --mem=1g
#SBATCH --nodes=1
#SBATCH --array=1-4
#SBATCH --cpus-per-task=1

module load Python/3.12.3

# This next line generates a unique input data file for each task which is launched, as long as we generated the 'residential_files.txt' input as
# per our earlier example.
INPUT_FILE=$(awk NR==${SLURM_ARRAY_TASK_ID} residential_files.txt)

# Then each task launches image_processor with a unique INPUT_FILE name
python3 image_processor.py -file $INPUT_FILE -out metadata
</code>

Because we don't have sequentially numbered input files we use a little trick with the ''awk'' command to read a specific line from our input file. In //this case// each task reads its own line from the input file; task **one** reading the filename on //line one//, task **two** reading the filename on //line two//, and so on.

This therefore gets exactly the same **x4** speed up as the previous example. The only difference is the potential order the files are processed and that they don't need to be sequentially named.

This achieves the same goal as the sequentially numbered task array example (four tasks launched, each reading a different input file), and while the ''awk'' line is a //little opaque// for those who have not written a great deal of shell script, you can treat it as an automatic filename supplier, as long as you give it your input file (in this case, ''residential_files.txt'' we created earlier).

----

=====  Testing a multi job solution ===== 

==== Testing the sequential filename solution ====

First, let's test the sequentially numbered task array solution:

<code lang=bash>
$ sbatch image_job_seq.sh 
Submitted batch job 1607653

# Check for output logs
$ ls slurm-1607653*
slurm-1607653_1.out  slurm-1607653_2.out  slurm-1607653_3.out  slurm-1607653_4.out

# See contents of output logs
$ cat slurm-1607653*
Opening source image: EuroSAT_RGB/Residential/Residential_1.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_2.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_2.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_3.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_3.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_4.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_4.jpg.txt
$
</code>

Has the data been generated correctly?

<code lang=bash>
$ ls metadata/
Residential_1.jpg.txt  Residential_2.jpg.txt  Residential_3.jpg.txt  Residential_4.jpg.txt
$ cat metadata/Residential_1.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1.jpg min:60 max:155 mean:92.23583984375
$ cat metadata/Residential_2.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_2.jpg min:54 max:254 mean:115.12548828125
$ cat metadata/Residential_3.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_3.jpg min:62 max:255 mean:102.17138671875
$ cat metadata/Residential_4.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_4.jpg min:56 max:158 mean:90.14990234375
$
</code>

**Yes**, the four files which were processed in parallel were analysed successfully, using our existing, sequential ''image_process.py'' application.

==== Testing the input data file solution ====

Now, let's test the solution which uses the input data file, and does not rely on sequentially named image files.

<code lang=bash>
# Submit job
$ sbatch image_job_inputfile.sh 
Submitted batch job 1607718

# Did Slurm logs get created for each task?
$ ls slurm-1607718_*
slurm-1607718_1.out  slurm-1607718_2.out  slurm-1607718_3.out  slurm-1607718_4.out

# Check Slurm output logs
$ cat slurm-1607718_*
Opening source image: EuroSAT_RGB/Residential/Residential_1000.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1000.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_1001.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1001.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_1002.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1002.jpg.txt
Opening source image: EuroSAT_RGB/Residential/Residential_1003.jpg
Converting to greyscale
Extracting pixel data
Saving pixel data as: metadata/Residential_1003.jpg.txt
$
</code>

So it seems to have ran okay (note that it processed the files in a different order, based on the ''sort'' method with used to generate ''residential_files.txt''). What about the analysed data files?

<code lang=bash>
# Did the output data files get created?
$ ls metadata/
Residential_1000.jpg.txt  Residential_1001.jpg.txt  Residential_1002.jpg.txt  Residential_1003.jpg.txt

# Check output data
$ cat metadata/Residential_1000.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1000.jpg min:47 max:218 mean:94.780517578125
$ cat metadata/Residential_1001.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1001.jpg min:61 max:254 mean:85.656982421875
$ cat metadata/Residential_1002.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1002.jpg min:66 max:254 mean:114.40478515625
$ cat metadata/Residential_1003.jpg.txt 
filename:EuroSAT_RGB/Residential/Residential_1003.jpg min:59 max:153 mean:79.7548828125
</code>

Again, **yes**; the four parallel tasks ran successfully, and they each generated (presumably) correct data from their independent image files.

We can be confident that both the sequential filename method and the input data file method do produce the //same// output - although the ordering of the image files could differ between the two, depending on how we order/sort the input file.

But hold on, we have //thousands// of input files, and we have only processed //four// of them! Well //yes//; as always we should start off small and test that things __work__ before we scale up. Read the //next section// to understand how to scale up your task arrays jobs.

----

=====  Scaling up & resource limits ===== 

If we look at the ''SBATCH'' headers we set in both ''input_job_seq.'' and ''input_job_inputfile.sh'' they are the same:

<code lang=bash>
#SBATCH --partition=default_free
#SBATCH --account=comet_abc123 # Remember to use your own account code
#SBATCH --mem=1g
#SBATCH --nodes=1
#SBATCH --array=1-4
#SBATCH --cpus-per-task=1
</code>

How we increase the number of parallel copies of ''image_process.py'' is to configure the ''--array'' values to better suit our needs. Remember that:

   * ''--array='' controls the quantity of tasks (**4**, in the case above) and the numbering of the ''${SLURM_ARRAY_TASK_ID}'' variable which each task inherits (**1 to 4** in the previous example)

It would seem that we could increase ''tasks'' to some //huge// number to run //all// of our input data files in parallel... but we are constrained by several factors:

   * In the case of Comet __we do not have__ //27000// CPU cores! (each task requires a //minimum// of 1 CPU core)
   * Even if we did have enough cores to launch an obscene number of tasks, this risks a detrimental impact of the HPC service to all other users!
   * Each HPC project will have a [[started:comet_resources|resource limit]] on the number of cores and jobs (or tasks) it can submit or have running at any time - a paid project may therefore be able to launch a //higher// number of tasks from an array job than an unfunded project

When increasing the ''array'' figure the form below is used:

   * array=**start**-**end**%**max_running**

The ''%max_running'' argument is optional, and if not set, the number of running tasks is the sequence from **start** to **end**, e.g:

<code>
# Four running tasks, numbered 1, 2, 3, 4
#SBATCH --array=1-4
</code>

<code>
# Twenty eight running tasks, numbered 100 to 127
#SBATCH --array=100-127
</code>

If you //include// the ''%max_running'' argument, then this applies a limit to the simultaneous running tasks, e.g.:

<code>
# Four tasks, limited to two running at a time, numbered 1, 2, 3, 4
#SBATCH --array=1-4%2
</code>

<code>
# Twenty eight tasks, limited to five running at a time, numbered 100 to 127
#SBATCH --array=100-127%5
</code>

A key point to remember is that Slurm will create a job entry for **each element of your array**. If you have an array which defines **28** entries, then you will get **28** new jobs created... if you have an array with **200** entries then you will create **200** new Slurm job entries. 

This means that large arrays can create huge numbers of Slurm jobs very, very quickly. For this reason we do currently have some limits set on the number of jobs both users and their entire group can create (see [[:started:comet_resources|Resource Limits - Understanding what they mean]] for more details on the current limits - **MaxSubmitJobs** and **GrpSubmitJobs** limits are applicable to the maximum size of the ''--array'' setting, but also take in to account //everything else// currently running or pending against your account code).

If you have //huge// numbers of data files to process, then you may need to approach this in several batches to avoid overload Slurm and the job submission system, and to stay inside the limits enforced on your project (you can use our ''sproject'' tool from [[:started:simple_slurm_tools|Simple Slurm Tools]] to quickly check your project limits).

==== Possible workarounds ====

<WRAP round box todo>
Workarounds for large numbers of files/arrays and staying under MaxSubmit and GrpSubmit limits.
</WRAP>

----

===== Downloads =====

You can download all of the scripts used in this example using the link below:

   * {{ :advanced:slurm_task_array_example.zip |}}

-----

[[:advanced:slurm|Back to Advanced Slurm Job Optimisation]]