====== Build A Parallel MPI Solution ======
**This section is incomplete**
This documentation section is still being written.
This is an //example// scenario and serves to highlight how you can use MPI to compute individual elements of a larger problem. It is a simple problem and in most real world cases you will have additional complexity to deal with, but the //principles// remain the same.
Many existing applications, libraries and frameworks //already// implement MPI, so you may not need to write the code yourself, however if you are writing something entirely new in C, C++, Fortran or Python (for example), you //may// need to use MPI to distribute your compute tasks over many cores, processors or nodes.
For anyone intending to write code for a complex problem using MPI //from scratch//, we would recommend taking a course such as the [[https://www.archer2.ac.uk/training/courses/250818-mpi/|ARCHER2 - Message Passing with MPI]] workshop.
----
===== Steps to a solution =====
We will approach building an MPI based compute solution by following the steps below:
* Identify the problem
* Writing a function or algorithm which solves a specific compute problem
* Testing that algorithm in a single threaded, single process
* Benchmarking the single process approach
* An approach to dividing the problem into discrete units
* Implementing an MPI framework on top of the existing function, without requiring any changes to the algorithm itself
* Benchmarking the parallel MPI approach
* Considering limitations to the MPI approach
----
===== Identify the problem =====
For the purposes of this MPI example, we are going to use a simple mathematics problem - finding prime numbers in a given (huge!) range.
This is an easy problem to illustrate and write code for, but it is one which demonstrates the thought processes which need to go in to designing your algorithm and the process for deconstructing a large problem into discrete sub-components.
Let us formally define the problem as:
> Find all prime numbers between two positive integers supplied by the user (called ''start'' and ''end'' - also the //searchspace//) and record the total number of primes found.
Clearly this is relatively fast on modern CPU hardware for a //small// range, but the larger the range we want to find primes for, the longer the processing will take. Given a large enough search space the problem becomes unsolveable in a //reasonable// timeframe if approached purely sequentially.
**Following this example**
If you want to follow this example as you read, you can jump to the [[#downloads|Downloads]] link at the bottom of the page and download a ''.zip'' with all of the code and job scripts needed to follow each step in turn.
----
===== Writing a function or algorithm =====
First, we need to write a function which takes our ''start'' and ''end'' integers and can calculate which prime numbers are in that range. We'll limit the implementation to [[https://en.wikipedia.org/wiki/32-bit_computing|32bit integers]] for simplicity.
In this case we are going to be using the **C** programming language, but the actual language implementation is //not// important at this stage - you could be using Python, Fortran, R or any other programming language with MPI bindings.
We write a //standalone// function called ''primeCount()'' which takes the input values ''start'' and ''end'', and also returns ''prime_count'', which is the total number of primes found. Because we're writing the algorithm as a standalone function, we will have more flexibility in how we call it later; generally extracting out core logic and algorithms to their own functions usually leads to cleaner applications.
Either extract the file from the [[#downloads|downloads]] ''.zip'' file, or save the file below as ''primes.c'':
#include
#include
#include
#include
#include "primes.h"
uint32_t primeCount(uint32_t start, uint32_t end){
// Calculate all of the primes between start and end
// Return a count of how many primes have been found
// This is *not* the most optimised version, but serves
// to illustrate the differences in sequential vs parallel processing.
uint32_t prime_count = 0;
uint32_t not_prime_count = 0;
uint32_t div_count = 0;
uint32_t start_num, end_num, i;
printf("primeCount: Calculating primes %d - %d\n", start, end);
for (start_num = start; start_num <= end; start_num++){
div_count = 0;
for (i = 2; i * i <= start_num; i++){
if (start_num % i == 0){
div_count++;
}
}
if (div_count > 0){
// Not prime
not_prime_count++;
} else {
// Prime
prime_count++;
}
}
printf("primeCount: Found %d primes\n", prime_count);
return prime_count;
};
Because it's written in **C**, and the function is not going to be used directly, but be called by other applications, we will also need a very small header file so that other applications know how to call it (what values it takes and what it returns). Other software we write will treat the actual code inside the function as a [[https://en.wikipedia.org/wiki/Black_box|black box]].
Either extract the file from the [[#downloads|downloads]] ''.zip'' file, or save the file below as ''primes.h'':
#include
uint32_t primeCount(uint32_t start, uint32_t end);
Now that we've written a function to implement our algorithm, we need to move on to the simplest method of testing it.
----
===== Testing the algorithm single process =====
Let's build a simple implementation which just calls our ''primeCount()'' function with two command line parameters (''start'' and ''end'') for the //entire// searchspace and then prints out the result which ''primeCount()'' sends back.
Either extract the file from the [[#downloads|downloads]] ''.zip'' file, or save the file below as ''single.c'':
#include "primes.h"
#include
#include
int main(int argc, char *argv[]){
uint32_t total_primes = 0;
uint32_t start = 0;
uint32_t end = 0;
// Arg1 is the number to start searching from
if (argv[1]){
start = atoi(argv[1]);
}
// Arg2 is the number to end the search at
if (argv[2]){
end = atoi(argv[2]);
}
// You must supply both arguments
if ((start < 2) || (end < 2)){
printf("You must enter two positive numbers in the range 2 - 2^32\n");
return 0;
}
printf("main: Calculating primes in the range %d - %d\n", start, end);
// Make a single call to the primeCount function, for the entire start-end search space
total_primes = primeCount(start, end);
printf("main: Found a total of %d primes\n", total_primes);
return 0;
}
Compile ''primes.c'' and ''single.c'' using GCC:
$ module load GCC
$ gcc -c primes.c -o primes.o
$ gcc -c single.c -o single.o
The output should be ''primes.o'' and ''single.o''.
With both files compiled we need to create a runnable programme by linking the ''primes.o'' library we have created
$ gcc -o single single.o primes.o
The output should be an executable named ''single''.
----
===== Benchmarking the single process =====
Now that we have a programme which runs, we can test it:
$ ./single
You must enter two positive numbers in the range 2 - 2^32
$
Okay, let's try with a simple test:
$ ./single 2 100
main: Calculating primes in the range 2 - 100
primeCount: Calculating primes 2 - 100
primeCount: Found 25 primes
main: Found a total of 25 primes
$
It works! We can measure the time by prefixing the command with the Linux ''time'' command:
$ time ./single 2 100
main: Calculating primes in the range 2 - 100
primeCount: Calculating primes 2 - 100
primeCount: Found 25 primes
main: Found a total of 25 primes
real 0m0.003s
user 0m0.000s
sys 0m0.004s
Not very long at all... but let's try increasing the //searchspace// to see how well it performs...
^ Searchspace ^ Primes Found ^ Runtime (sec) ^
| 2-100 | 25 | 0.003 |
| 2-1000 | 168 | 0.004 |
| 2-10000 | 1229 | 0.008 |
| 2-100000 | 9592 | 0.036 |
| 2-1000000 | 78498 | 0.758 |
| 2-10000000 | 664579 | 23.55 |
| 2-100000000 | 5761455 | 752.08 |
This looks like we experience an exponential slowdown as we start to increase the search space:
{{:advanced:single_process_primes.png?700|}}
We're going to have tackle the problem in a different way - further increases of the searchspace will quickly become infeasible to calculate in reasonable amounts of time.
----
===== Dividing the problem into discrete units =====
----
===== Benchmarking the parallel MPI approach =====
^ Searchspace ^ 2 Cores (sec) ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 ^ 9 ^ 10 ^ 11 ^ 12 ^ 13 ^ 14 ^ 15 ^ 16 ^
| 2-100 | 0.371 | 0.398 | 0.39 | 0.421 | 0.446 | 0.433 | 0.479 | 0.542 | 0.558 | 0.575 | 0.564 | 0.701 | 0.619 | 0.591 | 0.614 |
| 2-1000 | 0.39 | 0.385 | 0.427 | 0.388 | 0.428 | 0.431 | 0.446 | 0.541 | 0.517 | 0.561 | 0.563 | 0.612 | 0.638 | 0.628 | 0.601 |
| 2-10000 | 0.383 | 0.41 | 0.416 | 0.421 | 0.437 | 0.415 | 0.467 | 0.55 | 0.535 | 0.546 | 0.57 | 0.599 | 0.596 | 0.601 | 0.637 |
| 2-100000 | 0.392 | 0.411 | 0.433 | 0.428 | 0.428 | 0.445 | 0.472 | 0.526 | 0.569 | 0.577 | 0.56 | 0.58 | 0.595 | 0.66 | 0.606 |
| 2-1000000 | 0.873 | 0.726 | 0.669 | 0.641 | 0.635 | 0.599 | 0.637 | 0.678 | 0.68 | 0.713 | 0.696 | 0.696 | 0.699 | 0.735 | 0.666 |
| 2-10000000 | 15.87 | 11.33 | 8.794 | 7.23 | 6.17 | 5.43 | 4.83 | 4.79 | 4.75 | 4.67 | 3.873 | 4.012 | 3.517 | 3.603 | 3.49 |
| 2-100000000 | 490.077 | 349.832 | 267.5 | 216.451 | 182.87 | 158.38 | 139.7 | 123.457 | 114.97 | 104.508 | 112.836 | 110.21 | 101.214 | 103.716 | 95.16 |
{{:advanced:multi_process_prime_chart.png?1000|}}
----
===== Considering the limitations of the MPI approach =====
Some points to consider about using MPI to solve a problem in a distributed way:
* Increasing the amount of workers decreases overall time to solve the problem... //but//
* Increasing the amount of workers also //raises// the minimum latency to solve smaller problems
* There is a minimum time to solve - for small problems this is //greater// than the single process solution as we have to marshal all of the workers, distribute work and mark their results when finished
* There are decreasing gains after a certain threshold - if we run ''N'' workers we will //never// see a speedup of ''Nx'' (see [[https://en.wikipedia.org/wiki/Amdahl%27s_law|Amdahl's law]])
* This will vary based on your //problem//, the //algorithm// you are using and the //architecture// of the hardware you are running on - there is __no magic number__
* Large numbers of workers on the same node/processor can cause high levels of contention for CPU cache access and memory throughput
* Our problem involved only //limited// communication between workers and the parent process (hand the input values over and receive the result when complete)
* More complex algorithms that require additional inter-worker communication will have greater overheads, reducing our improvement to an even smaller value of ''N''
Additionally, if writing your implementation from scratch, there is always the overhead of writing the code to spin up your workers, arranged communication between the worker(s) and parent process, as well as the final //stitching together// of the result at the end. In our case the final process is perhaps overly simplistic (a single figure which is a tally of the number of primes found), but in your real world examples you may likely need to undertake further processing based on the data which was returned if working with complex data, or to rebuild your image map from all of the sub-tiles which have been processed.
However, for most real world problems, the initial MPI configuration likely represents a //small fraction// of your overall code complexity - if it does not, then it could be an indicator that an MPI solution may not be the best choice.
----
===== Downloads =====
You can download all of the scripts used in this example using the link below:
*
-----
[[:advanced:slurm|Back to Advanced Slurm Job Optimisation]]