deep learning exam

Please hand in the midterm via Sakai by 1:45pm on Wednesday, February 22nd. You are welcome to spend any amount of time working on the midterm before then. The midterm is “open-course-materials”: you are allowed to consult the textbook (d2l.ai), my lecture notes, class recordings, your own course notes, and past homeworks. You are welcome to check your answers by writing and running code, but this should not be necessary for any of the problems. No collaboration with other students or use of other sources is allowed.
Problem 1 (2 points)
In class, we introduced the weight decay regularizer, which adds a loss term !!!!2 where  is a vector comprising all of the parameters of the model and  is a scalar hyperparameter. An alternative is L1 regularization, which adds the loss term !!. Let the model’s original, unregularized loss be , so that the regularized loss is  + !!.
1. What is the gradient of the L1 regularization term with respect to the parameters ? For simplicity, you can assume the model has only one parameter, so that  is a scalar.
2. What is the parameter update under normal gradient descent for  + !!? Since you don’t know , you can include a !  term in your answer.
Problem 2 (6 points)
Consider the following model and loss function:
$ = ReLU($)
(,) = ( ” )2
where  is the input,  # $ is the scalar target, $ and  are learnable parameters, and
 # $ is the scalar output of the model.
1. Derive the expressions for the gradient of the loss with respect to every parameter in the
model (that is, % and % ). % %$
2. Re-express the model as a computational graph. The graph can have the following operations: Dot product, addition (not subtraction), elementwise maximum (ReLU), elementwise multiplication, and elementwise power (e.g. squaring). Your graph node
Code Help
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a should have leaf nodes corresponding to parameters and inputs.
reCAPTCHA challenge.
3. Specify which nodes in your graph have an output that needs to be cached in order to
use the backpropagaion algorithm.
Problem 3 (3 points)
Consider a convolutional neural network with the following structure:
• Input image of size 32x32x3 (32 width, 32 height, 3 channels)
• Convolutional layer with a 5×5 Zlter, 32 channels, stride of 1×1, and padding of 2×2
• ReLU nonlinearity
• Batch normalization
• Max pooling with a 2×2 window, stride of 2×2, and no padding
• Convolutional layer with a 3×3 Zlter, 64 channels, stride of 1×1, and padding of 1×1
• ReLU nonlinearity
• Batch normalization
• Max pooling with a 2×2 window, stride of 2×2, and no padding
• Convolutional layer with a 1×1 Zlter, 100 channels, stride of 1×1, and no padding
1. What is the output shape (dimensionality/size of the activations) at the output of these layers?
2. What is the receptive Zeld of the input for one of the activations at the output of the model? Recall that the receptive Zeld means the region of the input that in^uences a given activation.
3. Assume that we remove all padding from all convolutional layers in the model. What is the output shape of these layers now?
4. Now, instead assume that we remove the batch normalization and change the nonlinearities to be sigmoid instead of ReLU. What is the output shape of these layers now?
5. Say we want to use this model for 10-class classiZcation, so we add a single classiZcation layer (i.e. a softmax regression model) to the output. How many parameters will this classiZcation layer have?
_. Imagine instead that we add a global average pooliong layer on the output of the layer stack before the 10-way softmax classiZcation layer. How many parameters will the classiZcation layer have now?

CS Help, Email: tutorcs@163.com
Problem 4 (3 points)
1. In the olden days of neural networks, it was{popular to use the following nonlinearity:
Can you foresee any issue with using gradient descent to train a neural network that
uses this nonlinearity?
2. An alternative to the ReLU nonlinearity is the softplus:
() = log(1 + exp())
What is the gradient of this nonlinearity with respect to x? Name one advantage and one
disadvantage of this nonlinearity over ReLU.
Problem 5 (4 points)
A less-popular alternative to batch and layer normalization is weight normalization, which
replaces each weight vector  in the model with   where  is a new scalar parameter !!!!
and  is a new vector of parameters with the same shape as .
1. Derive expressions for !  and !  in terms of , , and ! .
2. Show that applying batch normalization without the shift parameter  to the preactivation  is equivalent to applying weight normalization if the entries of  are independently distributed with zero mean and unit variance.
Problem 6 (4 points)
Consider the Adam optimizer and assume that  =  ‘ 0, i.e. the gradient is the same value
 ‘ 0 at all iterations of training.
1.Assumethat = 0,0 <  < 1,and0 < 2 < 1andshowthat" = sign(). 2. Now, consider the case when  is large, e.g.  = 1. How does the function for " 浙大学霸代写 加微信 cstutorcs