AMPP 2022 06 29 blob master slides L06 MPIandOpenMP.pdf with the Cirrus and ARC

Parallel Design Patterns
Assessed Coursework, 2023
The deadline for PART TWO of the assessed coursework is Thursday 6th of April at 4pm. You will submit this via the PDP learn pages.
About the coursework
Part two of the coursework follows on from your initial work on part one. However, these two pieces are marked independently and your grade for part two will not depend on the answers given in part one.
In this part of the coursework you will:
• Write parallel code which parallelises the nuclear engineer’s reactor core model using
either geometric decomposition or the task-based parallelism pattern [75%]
• Write a report that explains your implementation and performance [25%]
Your code should be written in Fortran, C or C++ and parallelised with MPI. You must base your parallel implementation upon the serial code which has been provided (either the C or Fortran version), with the results remaining unchanged.
The serial code implements a version of the reactor core simulation model.
• Standard solution: Provide a parallel, distributed memory solution leveraging MPI, that follows either geometric decomposition or the task-based parallelism pattern correctly
for the provided serial code, this is sufficient for a mark of up to around 65%
o You will obtain higher marks in this range if you are able to demonstrate that your parallel solution running at scale is capable of significantly increasing the
artificial number of neutrons limit in the simulation.
• Excellent solution: To obtain marks beyond 65% and into the distinction (70%+) level
then you should do at least one, or both, of:
1. A framework providing geometric or task-based parallelism for a generic simulation
but containing no problem specific code for this model and that could be reused. Problem specific code provided separately that uses your framework to solve the problem for the nuclear engineer’s reactor model.
2. Provide a mixed OpenMP and MPI hybrid parallelisation, with OpenMP across cores within a NUMA region or node, and MPI distributed memory between processes.
Your code should compile and run on either Cirrus and/or ARCHER2, and at-least across multiple nodes (distributed memory parallelism). You can target ARCHER2 and/or Cirrus, and use any of the compilers available on those machines. Ensure you state which compiler you are using and target machine(s) in a README file submitted with your source code.
Code [75%]
Your code should:
• Compile and run on either Cirrus and/or ARCHER2
• Parallelise the nuclear engineer’s reactor core model based on the provided serial code
leveraging at-least distributed memory parallelism via MPI and potentially mixed OpenMP with MPI for an excellent solution. Note that providing a shared memory only

(OpenMP-only) parallelisation without MPI is not sufficient to be considered a standard
• Be clear and adopt a clean design.
• Be packaged neatly with a README file describing how the code should be built and
run. Should also include a makefile for building the code and submission script to run your executable on either Cirrus and/or ARCHER2 compute node(s). You should make it clear which machine your code will run on if it does not run on both.
• Be adequately commented to a level that would allow others to work on your parallel code in the future.
Performance and scalability are important considerations in this assessment. Whilst the focus here is on the parallelisation, there are some aspects of the existing serial code which are less than optimal, and credit will be provided if you address these as part of your solution too. You are free to change any part of the serial code that you wish as long as the result is correct.
As per the serial code, periodically a general summary of the status of the simulation should be displayed along with a final summary once the simulation terminates. This information is displayed by the serial code and must also be provided by your parallelised version. Furthermore, as per the serial code, a file should be generated which contains the state of the reactor core as the simulation progresses at specified points in time.
Suggested Configurations
There is a simulation configuration provided, config_simple.txt, which implements a small 1m by 1m reactor. You will see that even this takes a long time to simulate at the given accuracy. You are free to create your own configuration files and change any settings, for instance for performance scaling runs you might increase the size of the reactor core, the number of timesteps to increase the length of run or reduce the nanoseconds between timesteps to increase accuracy.
You will see an option MAX_NEUTRONS , this is an artificial limit on the number of neutrons set by the configuration in order to limit the complexity and hence the runtime. Ideally your parallelisation, on large core counts, will mean that this option can be significantly increased or removed.
The COLLISION_PROB_MULTIPLIER is another artificial setting which multiplies the probability of neutron fuel collision. This is required to produce fissions, firstly because we limit the number of neutrons artificially and secondly due to the simplification of some of the physics. Once you increase the number of neutrons then this multiplier can likely be reduced, but maybe not eliminated.
Report [25%]
Your report should mainly focus on the design of your implementation and the resulting performance you have obtained. You should explain how you have applied either the geometric decomposition or task-based parallelism pattern to the problem.
If you have developed a framework as part of an excellent solution, then you should document how it is designed to be used by the user and where the split between mechanism and policy lies. If you have undertaken hybrid OpenMP-MPI parallelisation, then you should explain your design and provide some performance comparison of this against MPI only to explore whether it is beneficial in this case.
Code Help, Add WeChat: cstutorcs
Credit will be given in the report for exploring the performance and scaling properties of the parallelised code, for instance via weak or strong scaling experiments on either Cirrus or ARCHER2. A discussion about aspects of the code’s design that help or hinder performance will also be rewarded, along with highlighting any fundamental limitations.
A discussion around how you guarantee correctness of your model, will gain you some extra credit, but lengthy descriptions or explanations about the output are not required to gain a very good mark.
Serial code provided to you
The serial code is provided to you in both C and Fortran versions. You should start from this, and the code itself comprises a number of distinct features:
1. The main simulation code which contains the program entry point and much of the logic as described in the details of the model. It is probably easiest to focus on this part first, and I suspect you can mainly focus on this when applying your choice of geometric decomposition or task-based parallelism pattern.
2. Configuration parsing which reads the input configuration file and parses the options into a structure/user derived type. I don’t imagine you will need to change this much, but feel free to edit if it’s helpful in any way.
3. Simulation support functions which provide utility functionality to implement specific facets of the details of the model. You may or may not need to change this depending on the parallelisation/optimisations adopted.
The physics of the model are fairly simplistic, and this is fine for our purposes, I strongly suggest against making them more advanced as you will not obtain marks for doing so. Whilst I will give credit for optimising existing serial code that has not been written optimally, I do not expect you to undertake advanced algorithmic changes.
Furthermore, if you find bugs in the serial code then let me know and I will fix and push out an update for everyone on Learn.
Details of the nuclear engineer’s reactor core model
The model that the nuclear engineers have written exhibits the following behaviour:
• The reactor core is represented as a cube in three dimensions and consists of numerous
two-dimensional channels which are cuboid.
o These channels run all the way down the reactor in the vertical (z) dimension
and in the x and y dimensions are 20 cm. Channels are placed next to each other, so for instance in a reactor core of 1m3 there will be five rows of five channels, each 1 metre deep in vertical length.
o Channels can contain nothing (they are empty), nuclear fuel assembly, control rod, a moderator, or a neutron generator. It is only possible for each individual channel to contain one of these.
o A fuel assembly is made up of fuel pellets. Each pellet is x=40mm by y=40mm by z=2mm and weighs 1 gram. Fuel pellets are stacked on top of each other all the way down the fuel assembly channel.
• The simulation progresses in timesteps, where a timestep is measured in nano seconds (which is 1e-9 seconds).
o The size of each timestep (in ns) is configurable as an input parameter.
• The simulation also contains lots of neutrons which are free to pass through the reactor core and are tracked. At every timestep, the code calculates the movement of each
neutron and updates its position.

o Once a neutron travels outside the reactor core it is deactivated and disappears from the simulation.
• At every timestep, for each neutron, the code will check whether it has interacted with the contents of the reactor core.
o If a neutron enters a fuel assembly channel, then the neutron’s absorption cross section is calculated which is determined by the neutron’s energy and the type of fuel in the reactor.
▪ This cross section is then used to calculate the probability that the neutron has been absorbed by the fuel which also depends on the current number of atoms of that fuel in the pellet. If the neutron is absorbed, then the atom of fuel that it has been absorbed by gains an extra neutron (e.g., goes from U235 to U236) and the neutron disappears from the simulation.
o If the neutron enters the moderator, then the neutron’s scattering cross section is calculated and absorption probability.
▪ Based on these, if the neutron collides with the moderator’s atoms, then it is slowed down (slower neutrons are more likely to cause fission). If the neutron is absorbed by the moderator, then it disappears from the simulation.
o If the neutron enters the control rod channel, then the code calculates whether the neutron has collided with the control rod.
▪ Control rods can be lowered a certain amount into the reactor, so whether the neutron hits the control rod is based upon how far the rod has been lowered and the location of the neutron.
▪ Any neutron-control rod collision results in absorption and the neutron disappears from the simulation.
• At each timestep the state of the reactor core is updated
o All atoms of U236 and Pu240 fission, each fission releases 200MeV of energy
▪ There is an 85% chance that U236 splits into Barium and Krypton which releases 3 neutrons. Otherwise, it will split into Xenon and Strontium, releasing 2 neutrons.
▪ There is a 73% chance Pu240 splits into Xenon and Zirconium releasing 3 neutrons. Otherwise, it will release a neutron and mutate into Pu239.
▪ The ejected neutrons’ energy (between 0 and 20 MeV) and resulting velocity components in the x, y and z dimensions is random.
o Neutron generators contain Californium-252 and this will release 23e12 neutrons per gram per second.
▪ Every cm in height of the neutron generator is half a gram in weight.
▪ Again, the resulting neutrons’ energy (between 0 and 20 MeV) and
velocity components in the x, y and z dimensions is random.
• The size and configuration of the reactor core (e.g., the type of each channel) is provided
by the user via a configuration script.
o Reactor fuel can be a mixture of U235, U238, Pu239 and the configuration script
provides the percentage of these in fuel assemblies.
▪ Only U235 and Pu239 will fission, U238 does not.
▪ From these percentages the code calculates the number of atoms in each
fuel pellet for each chemical, with other elements that are fission by- products (Barium, Krypton, Xenon, Strontium and Zirconium) set to zero initially.
o The moderator can be one of water, heavy water (deuterium) or graphite. Each has different neutron slowing and absorption properties. The weight of the moderator in each channel is provided in grams.
Programming Help
▪ For instance, water is more likely to slow neutrons but also much more likely to absorb them.
▪ We want the moderator to slow down the neutrons to increase the probability of fission, but we do not want it to absorb neutrons.
▪ The heavier the moderator then the more of it is present, this will raise the probability of neutron slowing and moderator absorption.
o The percentage of how far each control rod has been inserted into the reactor core can be provided, and if this is omitted then it is assumed a zero value (i.e. the control rod is fully out of the reactor).
• Due to limitations of the simulation (runtime and memory), the engineers have set a maximum number of neutrons that can be active at any one time.
o This artificial limit impacts the ability for the simulation to undertake fusion, and-so they have also introduced an artificial absorption probability multiplier, which increases the probability that an individual neutron is absorbed by the fuel by a specific multiplication factor.
o Ideally, the parallelisation of their code will mean that they can remove or significantly increase the maximum neutrons configuration limitation.
• Periodically the state of the reactor should be appended to a file.
o This should contain, the simulation time, amount of energy released via fission
and for each fuel assembly the number of atoms of Uranium (235, 236 and 238), Plutonium (239 and 240), Barium, Krypton, Xenon (134 and 140), Strontium and Zirconium present.
o The frequency of this reactor state storage is configurable by the user.
• There should be frequent summaries of simulation progress printed to stdio
o Which includes the current simulation time, the number of active neutrons, the number of fissions so far and total amount of energy released through fission.
• The simulation will terminate when a predetermined number of timesteps is reached.
o A short report is printed to stdio which reports the total number of fissions that have completed and associated energy release, along with the simulation code’s
An excellent solution
To obtain marks beyond around 65% and into the distinction (70%+) level then you will need to do at least one of the following activities:
1. A framework that splits mechanism from policy, where the mechanism of your geometric or task-based decomposition is provided in a generic, reusable manner so that other people could leverage it. Your problem specific code should call into the framework and utilise the framework to undertake all the parallelism. Problem specific code should be entirely abstracted from the mechanism of parallelisation, and you should make clear in your report the design of your framework, where the split between mechanism and policy lies, and how users would use your framework.
2. Provide a mixed OpenMP and MPI implementation which uses OpenMP across cores in a NUMA region and/or node, and MPI between these processes. You will need to ensure that MPI is initialised in thread mode (init_thread) and ensure that the threading mode you are using is supported by your MPI library (using query_thread). More information about OpenMP and MPI interoperability can be found at https://github.com/EPCCed/archer2-AMPP-2022-06-29/blob/master/slides/L06- MPIandOpenMP.pdf with the Cirrus and ARCHER2 websites explaining how to submit jobs with a mixture of processes and threads. Your report should explain how you have mixed OpenMP and MPI and provide some performance comparison against an MPI only approach to highlight whether this benefits the simulation code of not.

Doing only one of these well is sufficient for distinction level marks, and you will receive higher marks if you do both. I am more forgiving of a limited implementation if you do both however, so if you provide a framework and hybrid parallelisation which on their own are fairly limited and would score lower than a distinction, because you have done both then together this could take you above the distinction (70%) mark level.
Having Difficulty?
If you are struggling to get all the aspects running in parallel, then you may wish to limit your parallelisation to a simpler subset of the functionality. For example, you could modify the serial code so that neutrons are static throughout the simulation and not dynamically created or destroyed. If this is the case, then you should state clearly in your report which simplifications you have made. A working code for a simplified model could gain as good a mark or better than a broken code attempting the full model.
A code that does not quite work might be good enough to pass as long as the ideas are correct, and the code is accompanied by a good quality report. If a non-working code is submitted, the report should explain the parts that do work, should describe the symptoms of why the program is not working and the steps taken to try and fix the problems.
If all else fails, use the report to describe how you would have parallelised the code given more time. Describe the code that you have submitted.
Programming Help, Add QQ: 749389476