CS 189 / 289A Introduction to Machine Learning

CS 189 / 289A Introduction to Machine Learning
Fall 2022 Jennifer Listgarten, Jitendra Malik HW4
Due Tuesday 11/1 at 11:59pm

• Homework 4 consists of written and programming questions.

• We prefer that you typeset your answers using LATEX or other word processing software.
If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a
good time! Neatly handwritten and scanned solutions will also be accepted for the written
questions.

• In all of the questions, show your work, not just the final answer.

• You have slightly over a week to complete this homework.

Deliverables:

Submit a PDF of your homework to the Gradescope assignment entitled “HW4 Write-Up”. Please
start each question on a new page. If there are graphs, include those graphs in the correct
sections. Do not put them in an appendix. We need each solution to be self-contained on pages of

• In your write-up, please state with whom you worked on the homework. If you worked by
yourself, state you worked by yourself. This should be on its own page and should be the
first page that you submit.

• In your write-up, please copy the following statement and sign your signature underneath. If
you are using LaTeX, you must type your full name underneath instead. We want to make it
extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my
own words and that I have not looked at another student’s solutions. I have given credit to
all external sources I consulted.”

• Replicate all your code in an appendix. Begin code for each coding question in a fresh
page. Do not put code from multiple questions in the same page. When you upload this PDF
on Gradescope, make sure that you assign the relevant pages of your code from appendix to
correct questions.

HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 1

1 Eigenfaces (11 points)

In this question we will perform and analyze PCA on a dataset consisting of images of faces. Each
datapoint, x, consists of a flattened 62 × 47 pixel image (i.e. x ∈ R2914). The dataset can be
downloaded using the sklearn library as follows:

dataset = sklearn.datasets.fetch_lfw_people()

X = dataset[’data’]

This may take a few minutes to run. The file sizes are 250Mb so be sure that you have enough space
on your computer to store the images. The sklearn library should only be used for downloading
the data. For the rest of the problem you may use matplotlib, numpy.*, numpy.linalg.eig and
numpy.linalg.svd.

(a) (2 points) Plot the first 20 images to get familiar with the dataset.

Note: when plotting the images, be sure to reshape them to be a matrix of size 62 × 47. Images
can be plotted with matplotlib.pyplot.imshow. The argument cmap=matplotlib.pyplot.cm.gray
provides the best colormap to view the images

(b) (1 point) Recall that in order to perform PCA, we must first center our data. Compute the
average face of the dataset, center the data, and plot the average face.

(c) (3 points) Perform PCA on the dataset. Plot the first 20 images reconstructed after being
projected onto the top 10 principal component directions (PCDs). Do the same after projecting
onto the top 100 PCDs and the top 1000 PCDs.

(d) (1 point) For this dataset, we refer to the PCDs as “eigenfaces”. Plot the top 20 eigenfaces.

(e) (2 points) Recall from lecture that we can compute the percent variance explained by a certain
number of principal components by using the eigenvalues of the covariance matrix. Specif-
ically, letting Σ = 1nX̃

X̃ where X̃ ∈ Rn×d is the mean-centered data matrix and λ1 ≥ λ2 ≥

. . . λd ≥ 0 the eigenvalues of Σ, λi is the variance of the ith principal component. The percent
variance explained by the first k principal components is∑k

Note that λi = σ2i , where σi is the ith singular value of

Plot the percent variance explained as a function of k and determine how many principal com-
ponents are needed to explain 95% of the variance.

(f) (2 points) Use the first 80% of the dataset as your training set and the remaining 20% as the test
set. We will use the training set to compute the PCDs and we will evaluate our reconstruction
loss on both the training and test set. The reconstruction loss is defined as

∥X̃VkV⊤k − X̃∥

HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2

where Vk contains the k principal component directions and n and d are the dimensions of X̃.
For the following number of PCDs, [10, 20, 50, 100, 500, 1000, 2914], perform PCA using the
training set and compute the average reconstruction loss for both the training and test set. Plot
the error for both the training and test set as a function of the number of PCDs.

HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 3

2 Running Time of k-Nearest Neighbor and Kernelized Nearest Neighbor
Search Methods (6 points)

The method of k-nearest neighbors is a fundamental conceptual building block of machine learn-
ing. A classic example is the k-nearest neighbor classifier, which is a non-parametric classifier
that finds the k closest examples in the training set to the test example, and then outputs the
most common label among them as its prediction. Generating predictions using this classifier re-
quires an algorithm to find the k closest examples in a possibly large and high-dimensional dataset,
which is known as the k-nearest neighbor search problem. More precisely, given a set of n points,
D = {x1 . . . , xn} ⊆ Rd and a test point q ∈ Rd, there are 2 steps to decide what class to predict:

1. Find the k training points nearest q.

2. Return the class with the most votes from the k training points.

(a) (2 points) We can find the k nearest neighbors of q using the following procedure:

(i) Insert the first k training points into a max heap, ordered by distance from q. The top
(root) of the heap will be the furthest training point seen from q so far.

(ii) For each remaining training point, if the distance to q exceeds the largest distance in the
max-heap, skip it. Otherwise, pop off the root of the max-heap and insert the new training
point into the heap.

What is the runtime to classify a newly given test point q, using Euclidean distance, using
the above procedure to find its k-nearest neighbors? You may express your answer in big-O

(b) (1 point) Let us now ‘lift’ the points into a higher dimensional space by adding all possible
monomials of the original features with degree at most p. For example, for d = 2 and p = 3,
we lift a point [x, y]⊤ to the point [1, x, y, x2, xy, y2, x3, x2y, xy2, y3]⊤. What dimension would
this space be? What would the new runtime for classification of a test point q be, in terms of
n, d, k, and p? You may express your answers in big-O notation.

(c) (3 points) Decades of research have focused on devising a way of preprocessing the data so
that the k-nearest neighbors for each query can be found efficiently. “Efficient” means the time
complexity of finding the k-nearest neighbors is lower than that of the naı̈ve exhaustive search
algorithm—meaning that the complexity must be sublinear in n.

Many efficient algorithms for k-nearest neighbor search rely on a divide-and-conquer strategy
known as space partitioning. The idea is to divide the feature space into cells and maintain
a data structure that keeps track of the points that lie in each. Then, to find the k-nearest
neighbors of a query, these algorithms look up the cell that contains the query and obtain the
subset of points in D that lie in the cell and adjacent cells. Adjacent cells must be included in
case the query point is in the corner of its cell. Then, exhaustive search is performed on this
subset to find the k points that are the closest to the query.

HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 4

For simplicity, we’ll consider the special case of k = 1 in the following questions, but note that
this can be easily extended to the setting with arbitrary k. We first consider a simple partitioning
scheme, where we place a Cartesian grid (a rectangular grid consisting of hypercubes) over the
feature space.

Figure 1: Illustration of the space partitioning scheme we consider. The data points are shown as blue circles
and the query is shown as the red square. The cell boundaries are shown as gold lines.

If we search over a cell and its neighboring cells, how many cells do we search if the data
points are one-dimensional? Two-dimensional? d-dimensional? If each cell contains one
data point, what is the time complexity for finding the 1-nearest neighbor in terms of d,
assuming accessing any cell takes constant time?

HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 5

3 Entropy and Bagging (13 points)

This problem helps build intuition behind the core techniques that make random forests work:
entropy and bagging. We will first clarify the concepts of surprise and entropy. Recall that entropy
is one of the standards for us to split the nodes in decision trees until we reach a certain level of
homogeneity.

(a) (1 point) Suppose you have a bag of balls, all of which are black. How surprised are you if you
take out a black ball?

(b) (1 point) With the same bag of balls, how surprised are you if you take out a white ball?

(c) (1 point) Now we have 10 balls in the bag, each of which is black or white. Under what color
distribution(s) is the entropy of the bag minimized? And under what color distribution(s) is the
entropy maximized? Calculate the entropy in each case.
Recall: letting pB denote the fraction of balls that are black, the entropy of a random draw
from the bag is

Hb(pB) = −pB log pB − (1 − pB) log

where Hb(p) denotes the entropy of a Bernoulli(p) distribution.

(d) (1 point) Draw the graph of entropy Hb(pc) when there are only two classes C and D, with
pd = 1 − pc. Is the entropy function strictly concave, concave, strictly convex, or convex?

(e) (2 points) Suppose in a binary classification decision tree node, P(Y = 1), the proportion of
data points labeled 1, is equal to p. We split the points in the node based on whether the jth
feature is greater than v. Defining X j,v = 1{X j < v}, this results in two child nodes with label proportions P(Y = 1|X j,v = 1) = q1 and P(Y = 1|X j,v = 0) = q2. The information gain is defined as the decrease in entropy: I(X j,v; Y) = H(Y) − H(Y |X j,v), where H(Y) denotes the entropy of Y: P(Y = i) log P(Y = i) = Hb(P(Y = 1)) and H(Y |X j,v) denotes the conditional entropy of Y given X j,v: H(Y |X j,v) = P(X j,v = i)Hb(P(Y = 1|X j,v = i)). Show that p is a weighted average of q1 and q2, i.e. p = λq1+ (1−λ)q2 for some λ ∈ [0, 1], and that the information gain of the split is positive if both child nodes are nonempty and q1 , q2. (f) (2 points) Ensemble Learning - the motivation of averaging. Ensemble learning is a general technique to combat overfitting, by combining the predictions of many varied models into a single prediction based on their average or majority vote. Consider a set of uncorrelated HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6 random variables {Yi}ni=1 with mean µ and variance σ 2. Calculate the expectation and variance of their average. (In the context of ensemble methods, these Yi are analogous to the prediction made by classifier i. ) (g) (3 points) Ensemble Learning – Bagging. In lecture, we covered bagging (Bootstrap AGGre- gatING). Bagging is a randomized method for creating many different learners from the same Given a training set of size n, generate B random subsamples of size n′ by sampling with replacement. Within an individual subsample, some points may be chosen multiple times, while some may not be chosen at all. For large n, if n′ = n, around 63% of the points are chosen, and the remaining 37% are called out-of-bag (OOB) samples. (i) (2 points) Show that when n ≥ 25, the probability a point is not chosen for a particular subsample is in [0.36, 1e ≈ 0.368]. (ii) (1 point) If we use bagging to train our model, what is a practical way to choose the hyperparameter B? Typically in random forests, a few hundred to several thousand trees are used, depending on the size and nature of the training set. (h) (2 points) In part (f), we see that averaging reduces variance for uncorrelated classifiers. Real world prediction will of course not be completely uncorrelated, but reducing correlation will generally reduce the final variance. Reconsider a set of correlated random variables {Zi}ni=1. Suppose ∀i , j, Corr(Zi,Z j) = ρ. Calculate the variance of their average. HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 7 4 Machine Learning on the Diabetes Dataset (20 points) In this question, you will apply the following machine learning techniques to a dataset involv- ing diabetes patients: t-SNE, PCA, ordinary least squares (OLS), ridge regression, and K-nearest neighbors regression. Fetch the dataset from the sklearn library using diabetes = sklearn.datasets.load_diabetes() The data matrix X, labels y, and a dataset description can be accessed under the data, target, and DESCR keys. This dataset contains 442 data points, where each data point has 10 input features and a real-valued target. Thus X ∈ R442×10 and y ∈ R442. The input features include age, sex, BMI, blood pressure, and six blood serum measurements. The target is a quantitative measure of diabetes one year after the previous measurements were taken. (a) (2 points) Perform a two-dimensional t-SNE on the input data. You may use sklearn.manifold.TSNE for this purpose. Visualize the dimension-reduced data points in a scatter plot, and color code each input data point according to its target value. This can be done using the c and cmap arguments of matplotlib.pyplot.scatter. Also, include a colorbar in your plot. (b) (1 point) You should obtain two oblong clusters in the above t-SNE plot. If you do not, try running it again. What feature do these clusters correspond to? Re-plot the t-SNE plot, but with the data points colored according to that feature. (c) (2 points) Run t-SNE on the data matrix with the feature corresponding to the two clusters removed. What change do you see? Explain why the change occurs, and what it would mean if after removing the feature the change did not occur. (d) (1 points) Run a two-dimensional PCA to generate a scatter-plot color coding it by label like the t-SNE plot in part (a). If you are using numpy.linalg.svd, remember to center your data! Otherwise if you use sklearn.decomposition.PCA, the package will do it for you. (e) (3 points) For the next three parts, you will use the first 100 points of the data X train = X[:100] as your training set and the last 342 points as your test set. Compute the test MSE and test C-index of the OLS predictor with bias: OLS) = arg min ∥Xtrainw + b1 − ytrain∥ The test MSE equals ∥Xtestw∗OLS + b OLS1 − ytest∥ In scenarios with hard-to-predict-labels, the MSE can become very large and as a result, hard to interpret. The C-index metric gets around this issue, since it compares how data points are HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 8 ranked relative to each other, rather than their absolute label values. Letting ŷtest = Xtestw b∗OLS1 denote our test predictions, the C-index equals number of concordant pairs number of concordant pairs + number of discordant pairs A concordant pair is a pair 1 ≤ i , j ≤ ntest such that (ytest)i > (ytest) j and ŷi > ŷ j. A discordant
pair is a pair 1 ≤ i , j ≤ ntest such that (ytest)i > (ytest) j and ŷi < ŷ j. Given a random pair of data points where one has a higher label than the other, the C-index is the probability we predict a higher value for the data point with a higher label. A random predictor would get a C-index of 0.5 and a perfect predictor would get a C-index of 1. For the following parts, you will be doing cross validation on three different methods: ridge regression, PCA regression, and k-nearest neighbors regression. It is recommended that you write one cross validation function that can take in each type of method and run cross validation on it, to shorten your code. (f) (4 points) By dividing X train into 10 equal parts, run 10-fold cross validation on ridge re- ∥XCV trainw + b1 − yCV train∥ using the λ parameters [10−5, 10−4, 10−3, 10−2, 10−1, 1]. Plot the average validation MSE and average train MSE for each λ parameter on the same plot vs. log10(λ) on the x-axis. Choose the λ∗ with the lowest average validation MSE, find the ridge solution on the full train set with this λ∗, and compute the MSE and C-index of the resulting solution on the test set. (g) (3 points) Run 10-fold cross validation on PCA regression ∥X̃CV trainVkw + b1 − yCV train∥ varying the value of k (number of components) from the integers 1 to 10 inclusive. Vk ∈ R10×k contains the first k right singular vectors of the mean-centered cross validation training matrix X̃CV train. To make a prediction on a new data point xnew ∈ R10, we need to use the saved mean µCV train = X⊤CV train1 ∈ R 10 and Vk to compute ŷnew = (xnew − µCV train)⊤Vkw + b. The sklearn.decomposition.PCA class helpfully saves the mean vector and principal di- rections computed from the dataset it was applied to. Plot the average validation MSE and average train MSE for each value of k on the same plot vs. k on the x-axis. Choose the k∗ with the lowest average validation MSE, find the PCA regression solution on the full training set with k∗, and compute the MSE and C-index of the resulting solution on the test set. (h) (2 points) Repeat the previous three parts, minus the plots, now using the first ntrain = 300 data points as the training set and the latter 142 data points as the test set. How does the test performance of ridge and PCA regression with the best cross-validated hyperparameters λ∗ and k∗ compare to the test performance of OLS, when ntrain = 100 and when ntrain = 300? Does this agree with what you expect? HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 9 (i) (2 points) With the same train and test sets as in the previous part, run 10-fold cross validation on k-nearest neighbors regression, varying the value of k in [1, 2, 3, 5, 10, 20, 30, 50, 100]. No need to plot anything. You are encouraged to use sklearn.neighbors.KNeighborsRegressor. Using the k∗ with the lowest validation MSE, train a KNN regressor on the full train set and compute its test MSE and test C-index. HW4,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 10 Eigenfaces (11 points) Running Time of k-Nearest Neighbor and Kernelized Nearest Neighbor Search Methods (6 points) Entropy and Bagging (13 points) Machine Learning on the Diabetes Dataset (20 points)