Build A Parallel MPI Solution

This is an example scenario and serves to highlight how you can use MPI to compute individual elements of a larger problem. It is a simple problem and in most real world cases you will have additional complexity to deal with, but the principles remain the same.

Many existing applications, libraries and frameworks already implement MPI, so you may not need to write the code yourself, however if you are writing something entirely new in C, C++, Fortran or Python (for example), you may need to use MPI to distribute your compute tasks over many cores, processors or nodes.

For anyone intending to write code for a complex problem using MPI from scratch, we would recommend taking a course such as the ARCHER2 - Message Passing with MPI workshop.

Steps to a solution

We will approach building an MPI based compute solution by following the steps below:

Identify the problem
Writing a function or algorithm which solves a specific compute problem
Testing that algorithm in a single threaded, single process
Benchmarking the single process approach
An approach to dividing the problem into discrete units
- Implementing an MPI framework on top of the existing function, without requiring any changes to the algorithm itself
Benchmarking the parallel MPI approach
Considering limitations to the MPI approach

Identify the problem

For the purposes of this MPI example, we are going to use a simple mathematics problem - finding prime numbers in a given (huge!) range.

This is an easy problem to illustrate and write code for, but it is one which demonstrates the thought processes which need to go in to designing your algorithm and the process for deconstructing a large problem into discrete sub-components.

Let us formally define the problem as:

Find all prime numbers between two positive integers supplied by the user (called start and end - also the search space) and record the total number of primes found.

Clearly this is relatively fast on modern CPU hardware for a small range, but the larger the range we want to find primes for, the longer the processing will take. Given a large enough search space the problem becomes unsolveable in a reasonable timeframe if approached purely sequentially.

Following this example

If you want to follow this example as you read, you can jump to the Downloads link at the bottom of the page and download a .zip with all of the code and job scripts needed to follow each step in turn.

Writing a function or algorithm

First, we need to write a function which takes our start and end integers and can calculate which prime numbers are in that range. We'll limit the implementation to 32bit integers for simplicity.

In this case we are going to be using the C programming language, but the actual language implementation is not important at this stage - you could be using Python, Fortran, R or any other programming language with MPI bindings.

We write a standalone function called primeCount() which takes the input values start and end, and also returns prime_count, which is the total number of primes found. Because we're writing the algorithm as a standalone function, we will have more flexibility in how we call it later; generally extracting out core logic and algorithms to their own functions usually leads to cleaner applications.

Either extract the file from the downloads .zip file, or save the file below as primes.c - this is our main primeCount function we will use later:

#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
#include "primes.h"

uint32_t primeCount(uint32_t start, uint32_t end){
	// Calculate all of the primes between start and end
	// Return a count of how many primes have been found
	// This is *not* the most optimised version, but serves
	// to illustrate the differences in sequential vs parallel processing.
	uint32_t prime_count = 0;
	uint32_t not_prime_count = 0;
	uint32_t div_count = 0;
	uint32_t start_num, end_num, i;
	printf("primeCount: Calculating primes %d - %d\n", start, end);
	for (start_num = start; start_num <= end; start_num++){
		div_count = 0;
        for (i = 2; i * i <= start_num; i++){
            if (start_num % i == 0){
				div_count++;
			}
        }
        if (div_count > 0){
        		// Not prime
        		not_prime_count++;
		} else {
			// Prime
			prime_count++;
		}
    }
    printf("primeCount: Found %d primes\n", prime_count);
    return prime_count;
};

This is not intended to be an optimal prime number algorithm!

For the purposes of this exercise we are not using any advanced CPU features such as vectorisation hardware, nor are we looking to develop the most efficient algorithm (there are definitely improvements to be made!) - this is just to demonstrate that we can abstract out the main processing of our application (in a format that most users can follow) and use it in different ways, depending on whether we use a serial or parallel implementation later.

Because it's written in C, and the function is not going to be used directly, but be called by other applications, we will also need a very small header file so that other applications know how to call it (what values it takes and what it returns). Other software we write will treat the actual code inside the function as a black box.

Either extract the file from the downloads .zip file, or save the file below as primes.h:

#include <stdint.h>
uint32_t primeCount(uint32_t start, uint32_t end);

Now that we've written a function to implement our algorithm, we need to move on to the simplest method of testing it.

Testing the algorithm single process

Let's build a simple implementation which just calls our primeCount() function with two command line parameters (start and end) for the entire searchspace and then prints out the result which primeCount() sends back.

Either extract the file from the downloads .zip file, or save the file below as single.c:

#include "primes.h"
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]){
	uint32_t total_primes = 0;
	uint32_t start = 0;
	uint32_t end = 0;
	
	// Arg1 is the number to start searching from
	if (argv[1]){
		start = atoi(argv[1]);
	}
	// Arg2 is the number to end the search at
	if (argv[2]){
		end = atoi(argv[2]);
	}
	// You must supply both arguments
	if ((start < 2) || (end < 2)){
		printf("You must enter two positive numbers in the range 2 - 2^32\n");
		return 0;
	}
	
	printf("main: Calculating primes in the range %d - %d\n", start, end);
	// Make a single call to the primeCount function, for the entire start-end search space
	total_primes = primeCount(start, end);

	printf("main: Found a total of %d primes\n", total_primes);
	return 0;
}

Compile primes.c and single.c using GCC:

$ module load GCC
$ gcc -c primes.c -o primes.o
$ gcc -c single.c -o single.o

The output should be primes.o and single.o.

With both files compiled we need to create a runnable programme by linking the primes.o library we have created

$ gcc -o single single.o primes.o

The output should be an executable named single.

We have included the script compile.sh which will automate building the application for you. Just run this as ./compile.sh and it will load gcc and compile and link all the necessary code.

Benchmarking the single process

Now that we have a programme which runs, we can test it:

$ ./single
You must enter two positive numbers in the range 2 - 2^32
$

Okay, let's try with a simple test:

$ ./single 2 100
main: Calculating primes in the range 2 - 100
primeCount: Calculating primes 2 - 100
primeCount: Found 25 primes
main: Found a total of 25 primes
$

It works! We can measure the time by prefixing the command with the Linux time command:

$ time ./single 2 100
main: Calculating primes in the range 2 - 100
primeCount: Calculating primes 2 - 100
primeCount: Found 25 primes
main: Found a total of 25 primes

real	0m0.003s
user	0m0.000s
sys	    0m0.004s

Not very long at all… but let's try increasing the searchspace to see how well it performs…

Searchspace	Primes Found	Runtime (sec)
2-100	25	0.003
2-1000	168	0.004
2-10000	1229	0.008
2-100000	9592	0.036
2-1000000	78498	0.758
2-10000000	664579	23.55
2-100000000	5761455	752.08

This looks like we experience an exponential slowdown as we start to increase the search space:

We're going to have tackle the problem in a different way - further increases of the searchspace will quickly become infeasible to calculate in reasonable amounts of time. Depending on your use case you will probably encounter similar issues as you increase the quantity or complexity of data that you need to process.

At some point even the fastest single processor becomes too slow to process your data in a reasonable time.

You can test this under Slurm by using the included job_single.sh script. Just adjust the –account code option to your own Comet account code, and submit as sbatch job_single.sh.

Dividing the problem into discrete units

We know that the function which searches for primes in a given search space works, now we need to run those searches in parallel in order to find results in a shorter period of time.

Whilst this seems an easy step to take it can often be one of the more complex parts of software engineering - decided how to break up your data into discrete parts which can be processed independently. Fortunately in this case we can simply demark the search space into equal size chunks - you will likely have to spend substantial more effort to design your own processing based on your own data.

We can leave the primeCount function as-is, but we will need to refactor the code which calls it, so that we can run multiple searches in parallel, each on a different CPU core. We use the MPI API to enable calls to the primeCount function to be started up on individual CPU cores.

One process is the parent, or controller, and it is this process which spins up the others and receives the results from the other processes when they have finished their prime searches. These are handled by the MPI_Send() and MPI_Recv() function calls in the new code, below.

Each other process only calculates the primes in its own search space, and each process is given 1/N of the entire search space to iterate over.

#include "primes.h"

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[]){
	
	uint32_t total_primes = 0;
	uint32_t start, our_start = 0;
	uint32_t end, our_end = 0;
	uint32_t total_range, sub_range_size = 0;
	uint32_t found_primes = 0;
	
	int sub_process_id, process_id, cluster_size;
	
	// Arg1 is the number to start searching from
	if (argv[1]){
		start = atoi(argv[1]);
	}
	// Arg2 is the number to end the search at
	if (argv[2]){
		end = atoi(argv[2]);
	}
	// You must supply both arguments
	if ((start == 0) || (end == 0)){
		printf("You must enter two positive numbers in the range 1 - 2^32\n");
		return 0;
	}

	// Set up the MPI data structures
	// Every MPI process will get a unique *process_id* value - in our case we treat process_id 0
	// as the special 'controller' which sends the data values to each of the other processes, and
	// then receives the calculated results back.
	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &cluster_size);
	MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
	printf("main[%d]: Started process\n", process_id);
		
	// Calculate the sub-range that this instance is going to process
	// Each process gets to search 1/N of the entire search space.
	total_range = end - start;
	sub_range_size = total_range / cluster_size;
	our_start = start + (process_id * sub_range_size);
	
	if (process_id > 0){
		our_start = our_start + process_id;
	}
	our_end = our_start + sub_range_size;
	if (our_end > end){
		our_end = end;
	}
	
	// Only the main process prints this
	if (process_id == 0){
		printf("main[%d]: Total range is %d\n", process_id, total_range);
		printf("main[%d]: Sub-range per instance is %d\n", process_id, sub_range_size);
	}
	
	printf("main[%d]: Calculating primes %d - %d\n", process_id, our_start, our_end);

	// Every process calculates its own search space (1/N of the entire search space)
	// via the call to primeCount with the custom start and end
	found_primes = primeCount(our_start, our_end);
	printf("main[%d]: Found %d primes\n", process_id, found_primes);
	
	// Send found_primes from each of the instances (other than instance 0)
	if (process_id != 0){
		MPI_Send(&found_primes, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
	} 
	
	// Process_id 0 recieves the found_primes totals from each sub process
	if (process_id == 0) {
		printf("main[%d]: Now waiting for results from instances...\n", process_id);
		total_primes += found_primes;
		// Loop over all sub processes launched and wait for each one to send their results back
		for (sub_process_id = 1; sub_process_id < cluster_size; sub_process_id++){
			found_primes = 0;
			MPI_Recv(&found_primes, 1, MPI_INT, sub_process_id, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
			printf("main[%d]: Got %d from [%d]\n", process_id, found_primes, sub_process_id);
			total_primes += found_primes;
		}
	}
	
	// Wait for all sub processes to complete
	MPI_Finalize();
	if (process_id == 0){
		printf("main[%d]: Found a total of %d primes\n", process_id, total_primes);
	}
	return 0;
}

There's quite a bit more code now, but most of it is related to the starting up and communication of the additional MPI processes.

The important aspects are the controller starting up the other processes and then waiting for each of them to send their results back. Otherwise every single process can work entirely independently - this is the most optimal case. In many real world examples you may have processes which need to communicate partial results back intermittently during processing; the more communication which needs to happen during processing, the less speedup you will achieve.

Now compile this new multi.c source, as well as the original primes.c source code. Because we are building an MPI application we use mpicc which is a wrapper around gcc provided by OpenMPI:

$ module load OpenMPI
$ mpicc -c primes.c -o primes.o
$ mpicc -c multi.c -o multi.o

The output should be primes.o and single.o.

With both files compiled we need to create a runnable programme by linking the primes.o library we have created, again we use mpicc just as we used gcc to link the original version:

$ mpicc -o multi multi.o primes.o

We now have multi, which is the MPI version of our application. Read the next section to benchmark this new version.

If you downloaded the .zip included at the bottom of this guide, you should already have the script compile.sh which you ran earlier in the serial example - this should have created multi for you. If not, then run ./compile.sh again.

Benchmarking the parallel MPI approach

This version of the application works much the same as single and takes the same two arguments (start and end), but since it is now an MPI application, it must be started with mpirun which coordinates the startup of any additional processes we launch. We'll need to load OpenMPI first, and then call the new multi application as an argument, as follows:

$ module load OpenMPI
$ mpirun -n 2 ./multi 2 100

Here mpirun is given the argument -n 2 to instruct it to use two processes - each with one CPU core. If we were running under Slurm we would usually use the same number of cores we had requested via SBATCH headers. If not running under Slurm, you must instruct mpirun to use a specific number of processes (one process per CPU core) with the -n option otherwise it will attempt to grab all of them - do not do this on a shared system such as the Comet login nodes!

For example, to launch four processes, and therefore use four CPU cores:

$ module load OpenMPI
$ mpirun -n 4 ./multi 2 100

Now, going back to our benchmark results we can run the search spaces again, testing different numbers of processes (and therefore CPU cores). On a specific CPU we see the following results:

Searchspace	2 Cores (sec)	3	4	5	6	7	8	9	10	11	12	13	14	15	16
2-100	0.371	0.398	0.39	0.421	0.446	0.433	0.479	0.542	0.558	0.575	0.564	0.701	0.619	0.591	0.614
2-1000	0.39	0.385	0.427	0.388	0.428	0.431	0.446	0.541	0.517	0.561	0.563	0.612	0.638	0.628	0.601
2-10000	0.383	0.41	0.416	0.421	0.437	0.415	0.467	0.55	0.535	0.546	0.57	0.599	0.596	0.601	0.637
2-100000	0.392	0.411	0.433	0.428	0.428	0.445	0.472	0.526	0.569	0.577	0.56	0.58	0.595	0.66	0.606
2-1000000	0.873	0.726	0.669	0.641	0.635	0.599	0.637	0.678	0.68	0.713	0.696	0.696	0.699	0.735	0.666
2-10000000	15.87	11.33	8.794	7.23	6.17	5.43	4.83	4.79	4.75	4.67	3.873	4.012	3.517	3.603	3.49
2-100000000	490.077	349.832	267.5	216.451	182.87	158.38	139.7	123.457	114.97	104.508	112.836	110.21	101.214	103.716	95.16

If we consider only the 2-100000000 search space variant, we observe the following increase in speed, relative to the original single-CPU, serial version of the application:

Searchspace	Serial version	2 Cores (sec)	3	4	5	6	7	8	9	10	11	12	13	14	15	16
2-100000000	752.08	490.077	349.832	267.5	216.451	182.87	158.38	139.7	123.457	114.97	104.508	112.836	110.21	101.214	103.716	95.16
Relative speed	1x	1.5x	2.1x	2.8x	3.5x	4.1x	4.7x	5.4x	6.1x	6.5x	7.2x	6.7x	6.8x	7.4x	7.3x	7.9x

Actual results will vary depending on the type and architecture of the CPU processor being used. Generally, laptop processors will perform the worst, desktop/workstation processors in the middle, and higher-end server processors best due to architecture differences, cache sizes and memory channel configuration.

You can test this under Slurm by using the included job_mpi.sh script. Just adjust the –account code option to your own Comet account code, and submit as sbatch job_mpi.sh.

Why do we not get a 16x speed up with 16 CPU cores?

In this particular case, we start to see a levelling off of the speed gains around the 10-11 CPU core mark. This is likely to be due to the architecture of the workstation processor being used in this test - which was an Intel design with a mixture of performance (P-core) and efficiency (E-core) compute units. All processor types and processor architectures are different here and there is not a single rule which can accommodate all hardware designs.

Whilst we do not have mixed architecture processors on Comet (unlike most modern laptops and desktops/workstations), you may still hit memory bandwidth limitations that result in performance increases levelling off, or can see CPU core frequency reductions as you load up more and more active CPU cores. By experimenting with different numbers of CPU cores on a system such as Comet, you will likely find that the level at which performance increases tail off is much, much higher, but it will still exist.

Every design and every MPI workload will need to be tested to find the most optimal runtime parameters.

This is on top of the need for some overhead between the controlling process and each sub-process it runs - whilst this is very small in our example (limited to just the initial handover of data and the receipt of the results at the end) this still eats in to the actual time the application spends running.

Considering the limitations of the MPI approach

Some points to consider about using MPI to solve a problem in a distributed way:

Increasing the amount of workers decreases overall time to solve the problem… but
Increasing the amount of workers also raises the minimum latency to solve smaller problems
There is a minimum time to solve - for small problems this is greater than the single process solution as we have to marshal all of the workers, distribute work and mark their results when finished
There are decreasing gains after a certain threshold - if we run N workers we will never see a speedup of Nx (see Amdahl's law)
- This will vary based on your problem, the algorithm you are using and the architecture of the hardware you are running on - there is no magic number
- Large numbers of workers on the same node/processor can cause high levels of contention for CPU cache access and memory throughput
Our problem involved only limited communication between workers and the parent process (hand the input values over and receive the result when complete)
More complex algorithms that require additional inter-worker communication will have greater overheads, reducing our improvement to an even smaller value of N

Additionally, if writing your implementation from scratch, there is always the overhead of writing the code to spin up your workers, arranged communication between the worker(s) and parent process, as well as the final stitching together of the result at the end. In our case the final process is perhaps overly simplistic (a single figure which is a tally of the number of primes found), but in your real world examples you may likely need to undertake further processing based on the data which was returned if working with complex data, or to rebuild your image map from all of the sub-tiles which have been processed.

This was a simple example, written to illustrate how to use a common function (primeCount) in both a serial (single.c) and parallel (multi.c) implementation - adding MPI made our solution quite a bit more complex, however, for most real world problems, the additional MPI setup (setting up workers, distributing job data and receiving results) likely represents a small fraction of your overall code compared to the rest of your algorithm/workload/pipeline - if it does not, then it could be an indicator that an MPI solution may not be the best choice!

Downloads

You can download all of the scripts used in this example using the link below:

slurm_mpi_job_example.zip

Back to Advanced Slurm Job Optimisation