MPCS 51087 Problem Set 4

MPCS 51087
Problem Set 4
Machine Learning for Image Classification
Winter 2023
1 Intro: Basic Curve Fitting with Gradient Descent
Milestone 1 due Sunday March 5 @6pm: Prototype using High Level Langage Final Submission due Friday, March 10 @6PM
1.1 Linear Fit
As warm-up, consider minimization by gradient descent in a simpler context, with a known linear function of just several independent variables. This will guide your intuition on applying the same algorithm in a more complicated context in the next section. Consider the problem of fitting a line to a set of points in 2D space, where the points generally follow a linear trend but there is noise such that no straight line can pass through all points. We want to come up with a line of the form:
y = mx + b (1)
where m is the line’s slope and b is the line’s y-intercept. There are many different approaches to solving this proble. One method is to cast this as an optimization problem where we need to find “optimal” values for m and b, which has many analogues in machine learning.
To begin, we will need to come up with a way of defining what an “optimal” line looks like. For the first part of this assignment, we will use the mean squared error (MSE):
where there are n 2D points in our data set, yi is the y-coordinate of a point i, and yˆi is the y-coordinate of our best fit line given an x-coordinate at location i. That is, we are defining error as the average of the squared distance between the best fit line and each of our data points. Given our curve fit model of y = mx + b, we can write a final error function E in terms of the line coefficients as:
We therefore want to find the the coefficients m and b such that the error function E(m, b) is minimized. Recall from calculus that the minimum of a function is found at a point where the function’s gradient is zero (∇E = 0). So, to use gradient descent, we will want to be able to find the gradient of our error function as well. In this case, we can find the gradient analytically as follows:
(yi − yˆi)2 (2)
(yi −(mxi +b))2 (3)

i=1 Taking inventory, we now have:
(yi −(mxi +b))2 (4) 􏰀yi2 + b2 + 2bmxi − 2byi + m2x2i − 2mxiyi􏰁 (5)
=(∂E,∂E) (6)
= 􏰀2bxi+2mx2i−2xiyi􏰁 (7)
(2b+2mxi −2yi) (8)
• a set of points we want to generate a linear fit for
• a way of defining how “good” the fit is (an error function) given line coefficients m and b • a way of determining the gradient of our error function given coefficients m and b
Combined, these are all the elements we need to use the method of steepest descent (MSD), often referred
to as gradient descent in the context of machine learning algorithms, shown in Algorithm 1. We start with
some random guess for m and b, and compute the gradient ( ∂E , ∂E ) using our analytical expressions. We ∂m ∂b
then follow the gradient for some distance (using a “learning rate” α that we can choose via experimentation) and then repeat the process until we have arrived at values for m and b where the norm of our error function gradient is (nearly) zero – which indicates we have arrived at a minima of the error function.
Algorithm 1 Gradient Descent Curve Fitting Pseudocode For Linear Fit
1: Input: n points, α learning rate, ε convergence criteria
2: Guess m and b
4: ∇m=0 ▷Gradientofm
5: ∇b=0 ▷Gradientofb
6: for each point i do
7: ∇m=∇m+􏰀2bxi+2mx2i −2xiyi􏰁
8: ∇b=∇b+(2b+2mxi −2yi)
9: end for
10: ∇m = ∇m n
11: ∇b = ∇b n
12: m = m − α∇m
13: b = b − α∇b
14: if ∥(∇m, ∇b)∥ < ε then 15: Convergence detected, m and b are optimal 16: end if 17: end loop 2 Building and Training a Neural Network for Rasterized Digit Classtification Each incoming edge in the graph above is characterized by a decimal weight, representing the singnificance of the source node’s contribution to activation of the target node. These weights are initially set randomly, Programming Help
Figure 1: An example neural network with four densely connected layers. The first layer is known as the input layer, and represents the predictive variables; the final (fourth) layer is the output layer and represents the model’s prediction. Neural networks can also include an arbitrary number of intermediate (hidden) layers to increase the complexity of the model.
and the learning process seeks to optimally determine them by minimizing a cost function relative to a set of training data.
In addition to weights, each node (neuron) has a value, known as its activation. In layer 1, the activations are just the input values. In subsequent layers they are computed as linear combinations of the weighted activation values of all incoming edges, with some activation function, σ(x), applied to convert the output to a continuous value in some predefined range. Additionally, a bias is added to each weighted sum. Using index notation and numbering the layers 1, . . . , L, we denote the weights, bias, and activation values as:
• wl : weight for the connection from the kth neuron in layer l − 1 to the jth neuron in layer l. jk
• blj: bias of the j’th neuron in layer l.
• alj: activation of the j’th neuron in layer l.
2.1 Forward Propogation
The first step in training, known as forward propogation, takes a set of inputs in layer 1 and uses the weights, biases, and an activation function to compute node values in the next layer, and so forth, until the output is determined for the given set of weights.
The node values (activations) in the l’th layer given activation values in layer l-1 are given by:
or in matrix notation
􏰂􏰃 al =σ 􏰅􏰀wl al−1􏰁+bl
al = σ(wlal−1 + bl)
where the activation function σ(x) in our implemetnation will be the sigmoid function, f(x) =
which maps values to between 0 and 1. Equation 9 is the basic algorithm for forward propagation. The traning process starts with a given set of inputs and randomly chosen weights and biases, and uses forward propogation to calculate outputs. Once the outputs are calculated, a cost function is computed for each training input set and averaged across all sets.

2.2 Cost Function
In this example we use a quadratic cost funcion of the form
C = 1 􏰅 ||y(x) − aL(x)||2 (10)
where n is the total number of training examples; the sum is over individual training examples, x; y = y(x) is the corresponding desired output (correct answer); L denotes the number of layers in the network; and aL = aL(x) is the vector of activations output from the network when x is input. In other words, for each set of training data x, we compute the compute the squared norm of the “errors”: the vector difference between the correct output, y(x), and the predicted output, aL(x). The average across n training sets then gives the value of C (the factor of 1/2 is for convenience in computing the derivative). Since C = C(w,b), we seek the values of w, b that minimize C across the training data set. This is done as an iterartive process using an efficient process called backpropagation to compute derivitives of C with respect to b and w.
2.3 Backpropogation
Finding optimal w and b is done by using gradient descent, as discussed in Section 1. We select random values of (w, b) as a starting point, and then go “downhill” until we (hopefully) reach a global minimum.
For a neural network, computing derivatives efficiently is more complicated than the procedure described in Section 1. We need to evaluate the derivitive of the cost function with respect to the weights and biases:
∂C and ∂C and use this to update wl and bl for the next iteration. ∂wl ∂bl jk j
We start be defining the error at a node as:
δ jl = ∂ C
where zjl is defined as the weighted input to neuron j in layer l – that is, the input before the activation
function is applied:
Using the chain rule, the error in each node can then be written as:
∂ a lj In matrix form this can be written more concisely as:
z jl = w l a l − 1 + b l δ jl = ∂ C σ ′ ( z jl )
δl = ∇aC ⊙ σ′(zl)
where ∇a represents the derivative with respect to alj. For our choice of quadratic cost function, ∇aC =
(aL − y), and so
δl = (aL − y) ⊙ σ′(zl)
Then we can compute the error at any layer from the previous layer by working backwards from layer L as
δl = ((wl+1)T δl+1) ⊙ σ′(zl) This then allows us to compute the desired derivatives:
∂ C = δ jl ∂ b lj
∂C = al−1δl ∂wl k j
Computer Science Tutoring
To summarize, the equations for backpropogation are:
δL = ∇aC ⊙ σ′(zL) (11)
δl = ((wl+1)T δl+1) ⊙ σ′(zl) (12) ∂C =δjl (13)
∂C = al−1δl (14) ∂wl k j
2.4 Training Epochs and Pseudocode
Everything is now in place to train a neural network given a set of training data. One final approximation will make the overall process more computationally tractable. We use a variant of gradient descent called stochastic gradient descent (SGD). Rather than use all the training data to compute each step in the gradient descent algorithm, we randomly select a small batch of training inputs of size m. A unique batch is chosen for each step in the algorithm until all training data is exhausted. This is referred to as a training epoch, at which point the process begins again with new randomly chosen batches. The pseudocode below represents on epoch of training and assumes an outer loop where the epic is chosen randomly from all of the input data.
1. Input a set of training examples
2. For each training example x: Set the corresponding input activation ax,1 and perform the following steps:
• Feedforward: For each l = 2, 3, . . . , L compute zx,l = wlax,l−1 + bl and ax,l = σ(zx,l). • Output Error: Compute δx,L = ∇aC ⊙ σ′(zx,l)
• Backpropogate: for each l = L − 1, L − 2, …, 2 compute δx,l = ((wl+1)T δx,l+1) ⊙ σ′(zl)
3. Gradient Descent: for each l = L, L−1, L−2, …, 2 update the weights wl =⇒ wl− α 􏰄 δx,l(ax,l−1)T andthebiasesbl =⇒bl−mα􏰄xδx,l m x
2.5 Using Machine Learning to Identify Rasterized Digits
The MNIST database of handwritten digits (https://ossci-datasets.s3.amazonaws.com/mnist/) is per- haps the most widely used publicly-available image classification dataset in modern machine learning. It consists of 60,000 images in the training set, 10,000 images in the testing set, all grayscale and 28×28 pixels.
Using C/C++/Fortran and CUDA, implement a multilayer neural network from scratch and train it using stochastic gradient descent on the training dataset of the MNIST database of handwritten digits.
• The first layer (untrainable input layer) should consist of the 784 (= 282) pixels from a flattened 2D image.
• Then, there are an indeterminate number of intermediate hidden layers consisting of ni units each. Generally, these are sized differently (search Google for “bottleneck layer”, e.g.), but we will make them all the same size for this assignemnt.
• The last layer should consist of 10 units with sigmoid activations, roughly corresponding to the (un- normalized) predicted probabilities of each digit class.
• The cost will be based on the MSE loss, as discusssed earlier. For each training sample, there is a corresponding ground truth (“one-hot encoded”) vector y, which consist of a binary indicator for the correct class that is used in the MSE function. E.g for an input image of a handwritten 6, y = (0,0,0,0,0,0,1,0,0,0).

At a minimum, your code should take as command line arguments: 1) nl the number of dense (fully connected) linear layers in your NN (excluding the first and last layers), 2) nh the numnber of units in each of the hidden layers, 3) ne the number of training epochs, 4) nb the number of training samples per batch. It should output to stdout the average cost function and grind rate (samples/second) of each batch.
Note, the backpropagation formulas will be different from those derived in section 2.3 for MSE loss with sigmoid activations. Derive (or lookup) the gradient calculations for this network architecture. See footnote for a good discussion of why cross-entropy is preferable to MSE loss for this classification problem.1
Optional: use the cuDNN library to implement a convolutional neural network (CNN) of your choosing. Briefly compare the accuracy of this CNN model on the testing set to the model consisting only of fully connected layers. Use the TensorCore-enabled functions in the cuDNN library https://docs.nvidia. com/deeplearning/cudnn/developer-guide/index.html#tensor-ops-conv-functions. Note, the GPU nodes on Midway 3 all have Tensor Cores (Volta, Turing, Ampere generations). The Midway 2 GPU nodes do not have Tensor Cores.
Submission and Documentation
For the output layer, replace the sigmodid activation function with the softmax activation: ezi K
σ(z)i = 􏰄K for i = 1,…,K and z = (z1,…,zK) ∈ R j=1 ezj
where K = 10 here. Instead of the MSE loss, use the categorical cross-entropy loss function, where for a single sample:
where pc = σ(z)i output layer values.
− 􏰅 yc log(pc)
Your repository should include:
• A single PDF writeup containing all analysis and plots
• Your source code.
• Basic documentation on how to run your code (for instance, a README.txt in the same directory as
your code). If you are using a compiled language, include a Makefile.
• If you are using a Jupyter notebook friendly language like Python or Julia (for the first section only), you are welcome to in-line your writeup along with your source code into a single (nicely formatted) workbook and turn that in. Be sure to include a printed out PDF of the notebook along with your .ipynb file! If submitting in notebook format, there is no need to include a README.
1 https://bit.ly/3wLUjvb
Programming Help, Add QQ: 749389476