CUDA MatrixMult

Design and implement C and CUDA programs that called MatrixMult.c,
MatrixMultNaive.cu, and MatrixMultTiled.cu.
The MatrixMult.c, MatrixMultNaive.cu, and MatrixMultTiled.cu programs will use a
command line argument that corresponds to the row/column length to multiply two
square matrices with random single precision numbers (i.e., floats). They will then
output 1) the entire execution time in seconds from the beginning to the end of the
function/kernel call only, and 2) the product of the two matrices with each element
printed out to a file called “product.dat” in a tab-delimited, row/column format. The
C program must implement a serial code via a function call. The CUDA
MatrixMultNaive.cu program must implement a parallel code via a kernel call using
just global memory, and the MatrixMultTiled.cu program must implement a parallel
code via a kernel call using tiles and shared memory.
You must check for 1) the appropriate number of command line arguments and 2)
whether the command line argument corresponds to a positive number (you may
round floating point inputs). Appropriate error messages must be issued, followed
by a graceful exit.
To analyze the relative performances of the CPU vs. GPU implementations of your
matrix multiplication algorithms, you will compute the time it takes to complete a
matrix multiplication function/kernel call. For the GPU implementations, you will
show how the block sizes impact performance by running them with tile sizes of

16×16 and 32×32 (you can leave your final code with either tile size). You will then
construct graphs (CPU vs. Naive and CPU vs. Tiled) of the speedup vs. matrix size
(i.e., row/column length) for several tile sizes to see how their performances vary
with system size and tile size. It is up to you to choose appropriate values for the
matrix size to construct your graphs, but a matrix sizes of 1,021 and 2,015 will be
used for testing during grading.
Hint: In the execution of your GPU implementation, for accurate timing information,
you must call the kernel twice because the GPU initialization takes time that is
unrelated to the algorithm.