STATS HW3 solutions
REMINDER: MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT
Homework 3: In this homework, you will train a two layer neural network using gradient descent. However, instead of manually computing the gradients, you will use the autodiff provided by Tensorflow package.
import numpy as np
import tensorflow as tf
Two-layer Neural Network¶
A two layer neural network contains an input layer, a hidden layer, and an output layer. The number of the input nodes in the diagram below is determined the dimension of our features. We are free to choose the dimension of the hidden layer. As for the final output layer, the number of nodes is determined by the type of the problem we are doing. For instance, for a regression problem, we will only have one node and the output value corresponds to our prediction of the target. As for classification, we will first output a vector that has same number of dimension as the number of classes in our classification dataset. Then, we will apply the softmax transformation on the vector to transform real-valued predictions to the class probabilities.
Mathematically, this model can be written as
$$f(x) = \sigma(x^{\intercal} W_1 +b_1) W_2 + b_2. $$Note that if $x \in \mathbb{R}^{d \times 1}$, then $W_1 \in \mathbb{R}^{d \times H}$ and $W_2 \in \mathbb{R}^{H \times O}$, where $O$ is the output dimension. The dimension of $b_1$ and $b_2$ is self-evident.
Given an input $x$, the model first transforms it using the weight matrix $W_1$ and subsequently shifts the output by the bias term $b_1$. The function $\sigma(.)$ is called the activation function that introdues non-linearity in the model. For the purpose of this homework, we will use the so called relu-activation function that is defined as $\sigma(t) = \text{max}\{t, 0\}$. Note that that $x^{\intercal} W_1 +b_1$ generally gives us a vector, so we have to apply the relu activation to each element of the vector. Following the activation, the vector $h(x) =\sigma(x^{\intercal} W_1 +b_1)$ is defines the hidden layer. Finally, we apply the linear trasformation $ h(x) W_1 + b_2$ on the hidden layer.
This representation of the network is very convenient if you instead want to do matrix operations on your data. Suppose $X$ be your data matrix where $i^{th}$ row of $X$ contains $x_i^{\intercal}$, then the output of the network on the entire dataset can be written as
$$f(X) = \sigma(X W_1 +b_1) W_2 + b_2. $$
Regression¶
In homework 2, you trained a linear regression model on california housing dataset. In this homework, you will train a two-layer neural network on this dataset using gradient descent.
# load the dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
california_housing = fetch_california_housing( return_X_y=True, as_frame=True)
X = california_housing[0]
y = california_housing[1]
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
sc=StandardScaler()
X_train=sc.fit_transform(X_train_unscaled)
X_test = sc.transform(X_test_unscaled)
# convert numpy arrays to tensors
X_train = tf.convert_to_tensor(X_train, dtype = tf.float32)
X_test = tf.convert_to_tensor(X_test, dtype = tf.float32)
y_train = tf.convert_to_tensor(y_train, dtype = tf.float32)
y_test= tf.convert_to_tensor(y_test, dtype = tf.float32)
Question 1: Fill in the input dimension and output dimension of the two layer neural network for regression on this dataset. (5 pts)
For the purpose of this homework, we will just choose hidden layer with dimension double that of input dimension. However, going forward choosing the hidden dimension appropriately would be an important part of deep learning.
Question 2: Define a tensorflow variables for weights W1, b1, W2, and b2. Then, initialize both biases b1 and b2 to be 0 vectors and initialize W1 and W2 by picking values uniformly at random from the interval [0,1]. (5 pts)
W1 = tf.Variable(initial_value=tf.random.uniform(shape=(input_dim, hidden_dim)))
b1 = tf.Variable(initial_value=tf.zeros(shape=(hidden_dim,)))
TensorShape([1, 16])
Question 3: Complete the model function below to define a two layer neural network. Here, inputs is a matrix of shape $n \times d$, where $i^{th}$ row of the inputs matrix contains $x_i^{\intercal}$. (10 pts)
Hint: Use tf.nn.relu() function for relu activation.
# Replace “________” with your code
def model(inputs):
hidden = # hidden layer with relu-activation
Question 4: Write a function below that computes mean-squared error given the predictions and targets. You should use tensorflow operations like tf.square and tf.reduce_mean for autodiff to work. (10 pts)
def mse(predictions, targets):
squared_error = tf.square(predictions-targets)
return tf.reduce_mean(squared_error)
Question 5: Complete the function below that takes in features and targets and trains your two layer neural network using gradient descent. Please use autodiff by Gradient taping. (10 pts)
learning_rate = 0.01
def train(inputs, targets):
with tf.GradientTape() as tape:
predictions = model(inputs) #get predictions
loss = mse(predictions, targets) #compute loss
Question 5: Complete the routine below that divides the randomly shuffled data into multiple minibatches of size 500 and use the train function above to run gradient descent on those minibatches. Within each step, you should make a complete pass through the dataset. (10 pts)
n = X_train.shape[0]
k = int(n/B)
Loss at step 9: 1.4963
Loss at step 19: 1.4131
Loss at step 29: 1.3795
Loss at step 39: 1.3641
Loss at step 49: 1.3561
Loss at step 59: 1.3517
Loss at step 69: 1.3485
Loss at step 79: 1.3471
Loss at step 89: 1.3458
Loss at step 99: 1.3447
Question 6: Get predictions on your test data set and report the MSE loss on the test data. (5 pts)
predictions_test = model(X_test)
mse(predictions_test, y_test)
Multiclass Classification¶
Next, you will train the two layer neural network to do multiclass classification on digits dataset. Digits data is similar to MNIST but has even smaller pixel.
from sklearn.datasets import load_digits
X,y_int = load_digits(return_X_y=True)
In HW2, you manually created an one hot encoding of the multiclass targets. We can use a function from keras, which is a high level deep learning API thats works pretty well with tensorflow.
from keras.utils.np_utils import to_categorical
y_one_hot = to_categorical(y_int, num_classes=10)
# training-testing split and appropriate rescaling
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y_one_hot, test_size=0.3, random_state=42)
sc=StandardScaler()
X_train=sc.fit_transform(X_train_unscaled)
X_test = sc.transform(X_test_unscaled)
#convert to tensors
y_test= tf.convert_to_tensor(y_test, dtype = tf.float32)
Question 7: Fill in the input dimension and output dimension of the two layer neural network appropriate for this dataset. (5 pts)
input_dim = X_train.shape[1]
Question 8: Define tensorflow variables for weights W1, b1, W2, and b2. Then, initialize both biases b1 and b2 to be 0 and initialize W1 and W2 by picking values uniformly at random from the interval [0, 0.1]. (5 pts)
Question 9: Complete the classifier function below to define a two layer neural network that outputs class probabilities for each class. Here, inputs is a matrix of shape $n \times d$, where $i^{th}$ row of the inputs matrix contains $x_i^{\intercal}$. (10 pts)
Hint: use tf.softmax.nn() function to compute softmax class probabilities.
def classifier(inputs):
Question 10: Complete the function below to compute cross entropy loss given one hot encoding of targets and predictions with class probabilities for each prediction. (10 pts)
def cross_entropy(predictions, targets):
Question 11: Complete the function below to train the neural network classifier using gradient descent. (5 pts)
learning_rate = 0.1
def train(inputs, targets):
with tf.GradientTape() as tape:
Train the model using 500 GD iterations.
for step in range(500):
loss = train(X_train, y_train)
if (step +1)%10==0:
print(f”Loss at step {step}: {loss:.4f}”)
Loss at step 9: 2.2013
Loss at step 19: 2.0655
Loss at step 29: 1.8907
Loss at step 39: 1.6744
Loss at step 49: 1.4333
Loss at step 59: 1.1944
Loss at step 69: 0.9812
Loss at step 79: 0.8036
Loss at step 89: 0.6634
Loss at step 99: 0.5572
Loss at step 109: 0.4775
Loss at step 119: 0.4170
Loss at step 129: 0.3699
Loss at step 139: 0.3324
Loss at step 149: 0.3017
Loss at step 159: 0.2760
Loss at step 169: 0.2544
Loss at step 179: 0.2357
Loss at step 189: 0.2194
Loss at step 199: 0.2053
Loss at step 209: 0.1928
Loss at step 219: 0.1818
Loss at step 229: 0.1720
Loss at step 239: 0.1631
Loss at step 249: 0.1552
Loss at step 259: 0.1479
Loss at step 269: 0.1412
Loss at step 279: 0.1352
Loss at step 289: 0.1296
Loss at step 299: 0.1244
Loss at step 309: 0.1196
Loss at step 319: 0.1151
Loss at step 329: 0.1110
Loss at step 339: 0.1071
Loss at step 349: 0.1035
Loss at step 359: 0.1000
Loss at step 369: 0.0968
Loss at step 379: 0.0937
Loss at step 389: 0.0908
Loss at step 399: 0.0880
Loss at step 409: 0.0854
Loss at step 419: 0.0829
Loss at step 429: 0.0805
Loss at step 439: 0.0783
Loss at step 449: 0.0761
Loss at step 459: 0.0740
Loss at step 469: 0.0721
Loss at step 479: 0.0702
Loss at step 489: 0.0684
Loss at step 499: 0.0667
Suppose, given the vector of class probabilities, you output the label with the highest class probability as your label. As an evaluation of our model, we want to compute the number of mistakes that the model makes. For instance, if $y$ is the true label and $\widehat{y}$ is the prediction of the model, we will evaluate our model on this point with 0-1 loss
$$\mathbb{1}( \widehat{y} \neq y ).$$
Over $n$ points, we will compute the mean 0-1 loss,
$$\frac{1}{n}\sum_{i=1}^n \mathbb{1}( \widehat{y}_i \neq y_i ).$$
Question 11: Compute the mean 0-1 loss of your classifier on the test data. (5pts)
0.03333336114883423