HPC resit example good report

S(p;f = 0.1) t(1)/t(p)
1 Introduction
HPC resit good example report May 10, 2024
Figure 1: Scaling results for the OpenMP implemen- tation. Speedup S(p) = t(1)/t(p), and efficiency E (p) = S (p)/p results for my implementation of the parallel problem, compared to Amdahl’s Law eq. ??, with f = 0.1.
compiler and runtime system to freely optimize the mapping of iterations to threads, which allows for a good schedule to be chosen, potentially implementing hardware optimisations, or improving the mapping duringruntime.
In strong scaling a metric of “Speedup” is used. Speedup is defined to be t(1)/t(p) where t(p) is the run- time on p processes. This value is heavily dependent on the fraction f of the program that is non-parallelizable. A simple but well known model to use is Amdahl’s Law which defines a relation between t(p), f, and p:
t(p) = (1 − f) · t(1) · p−1 + f · t(1).
Amdahl’s Law is fit to the data using a minimisation from the scipy library. As shown in figure 1, the speedup follows Amdahl’s Law closely for low numbers of processes, however it diverges for processes closer to the problem size N. This divergence is likely due to pro- cesses having small workloads, making the concurrency overhead non-negligible.
The purpose of this report is to describe and justify the OpenMP implementation to parallelize a serial C code which solves a variant of a 2D partial differen- tial equation describing neuron excitation known as the FitzHugh-Nagumo model (FHN). This program up- dates its variables in time step intervals defined as dt, every time step it updates the current time t and eval- uates the FHN, it then updates other variables based on this evaluation. Every m time steps it will output normalized variables.
2 Parallel Approach
The serial implementation requires that threads be synchronized when evaluating the PDEs, updating variables, and when calculating the norms. This is due to these calculations needing up to date variables on all threads, so to parallelize using OpenMP, the in- dividual calculations are parallelised. Each calcula- tion loops over a N by N matrix, a perfect opportu- nity for using #pragma omp parallel for, addition- ally collapse(2) is used to allow allocation flexibility, especially when N is not wholly divisible by the number of threads.
In this implementation all #pragmas set default(none) unless all variables are shared, this is to help legibility and easier debugging. Doing this forces all used variables to be explicitly either shared or private. The variables that are shared con- sist of variables that are only read (eg. N) or accessed by index (eg. dv). Variables that are declared as private are all reassigned many times during the loop, and so need to be private to prevent race conditions. One function where a used variable was not assigned either shared or private was in the normalization, as this requires access to a single variable on all threads, in this instance a reduction(+: ) clause was ap- propriate to use. A schedule was also defined in every #pragma statement. For the initialisation function a schedule(static) was used. This was because each run of the program only calls this function once and there is no branching so the workload is even on all threads. The other functions are called many more times, the number of times that they are called is based on the number of time steps in the algorithm ( 50000 for evaluating the PDE for example). Since there are so many function calls, OpenMP’s schedule(auto) can be leveraged. This schedule policy allows the
100 101 102 p
程序代写 CS代考加微信: cstutorcs