CS 189 hw1

CS 189 / 289 Introduction to Machine Learning
Fall 2022 Jennifer Listgarten, Jitendra Malik HW1
Due 09/20/22 11:59 pm PT

• Homework 1 consists of both written and coding questions.

• We prefer that you typeset your answers using LATEX or other word processing software.
If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a
good time! Neatly handwritten and scanned solutions will also be accepted for the written
questions.

• In all of the questions, show your work, not just the final answer.

Deliverables:

Submit a PDF of your homework to the Gradescope assignment entitled “HW1 Write-Up”. Please
start each question on a new page. If there are graphs, include those graphs in the correct
sections. Do not put them in an appendix. We need each solution to be self-contained on pages of

• In your write-up, please state with whom you worked on the homework. If you worked by
yourself, state you worked by yourself. This should be on its own page and should be the
first page that you submit.

• In your write-up, please copy the following statement and sign your signature underneath. If
you are using LaTeX, you must type your full name underneath instead. We want to make it
extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my
own words and that I have not looked at another student’s solutions. I have given credit to
all external sources I consulted.”

• Replicate all your code in an appendix. Begin code for each coding question in a fresh
page. Do not put code from multiple questions in the same page. When you upload this PDF
on Gradescope, make sure that you assign the relevant pages of your code from appendix to
correct questions.

1 (12 points) Multivariate Gaussians: A review

(a) (4 points) Consider a two dimensional, zero mean random variable Z =

order for the random variable to be jointly Gaussian, a necessary and sufficient condition is

• Z1 and Z2 are each marginally Gaussian, and

• Z1|Z2 = z is Gaussian, and Z2|Z1 = z is Gaussian.

A second characterization of a jointly Gaussian zero mean RV Z ∈ R2 is that it can be written
as Z = AX, where X ∈ R2 is a collection of i.i.d. standard normal RVs and A ∈ R2×2 is a matrix.

Note that the probability density function of a non-degenerate (meaning the covariance matrix
is positive definite and thus invertible) multivariate Gaussian RV with mean vector, µ, and
covariance matrix, Σ, is:

f (z) = exp

(z − µ)TΣ−1(z − µ)

Let X1 and X2 be i.i.d. standard normal RVs. Let U denote a binary random variable uniformly
distributed on {−1, 1}, independent of everything else. Use one of the two characterizations
given above to determine whether the following RVs are jointly Gaussian, and calculate the
covariance matrix (regardless of whether the RVs are jointly Gaussian).

• Z1 = X1 and Z2 = X2.

• Z1 = X1 and Z2 = X1 + X2.

• Z1 = X1 and Z2 = −X1.

• Z1 = X1 and Z2 = UX1.

(b) (2 points) Show that two Gaussian random variables can be uncorrelated, but not independent
(Hint: use one of the examples in part (a)). On the other hand, show that two uncorrelated,
jointly Gaussian RVs are independent.

(c) (1 point) With the setup in (a), let Z = VX, where V ∈ R2×2, and Z, X ∈ R2. What is the
covariance matrix ΣZ? Is this necessarily true if X has the identity matrix I ∈ R2×2 as its
covariance matrix, but is not a multivariate Gaussian?

(d) (2 points) Use the above setup to show that X1 + X2 and X1 − X2 are independent. Give another
example pair of linear combinations are independent (You may not simply use multiples of
X1 + X2 and X1 − X2, or X1 and X2).

(e) (3 points) Given a jointly Gaussian zero mean RV Z = [Z1 Z2]⊤ ∈ R2 with covariance matrix

Σ11 Σ12

, derive the distribution of Z1|Z2 = z.
HW1,©UCB CS 189 / 289, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2

Hint: The following identity may be usefula bb c
−1 =  1 0

1 −bc0 1

2 Linear Regression, Projections and Pseudoinverses (13 points)

We are given X ∈ Rn×d where n > d and rank(X) = d. We are also given a vector y ∈ Rn. Define
the orthogonal projection of y onto range(X) as PX(y).

Background on orthogonal projections For any finite-dimensional subspace W (here, range(X)) of
a vector space V (here, Rn), any vector v ∈ V can be decomposed as

v = w + u, w ∈ W, u ∈ W⊥,

where W⊥ is the orthogonal complement of W. Furthermore, this decomposition is unique: if
v = w′ + u′ where w′ ∈ W, u′ ∈ W⊥, then w′ = w and u′ = u. These two facts allow us to define
PW , the orthogonal projection operator onto W. Given a vector v with decomposition v = w + u,

PW(v) = w.

It can also be shown using these two facts that PW is linear. For more information on orthogonal
projections, see https://gwthomas.github.io/docs/math4ml.pdf.

(a) (2 points) Prove that PX(y) = arg min
w∈range(X)

∥y − w∥22.

Side Note: In lecture, we sketched the geometric intuition of a projection and the connection
between minimizing the least squares loss L(θ) = ∥y − Xθ∥22 and finding a projection. Least
squares seeks the vector in the columnspace of X that is the closest to y. Hence, PX(y) = Xθ∗.

(b) (3 points) An orthogonal projection is a linear transformation. Hence, we can define PX(y) =
Py for some projection matrix P. Specifically, given 1 ≤ d ≤ n, a matrix P ∈ Rn×n is said to
be a rank-d orthogonal projection matrix if rank(P) = d, P = P⊤ and P2 = P. Prove that P
is a rank-d projection matrix if and only if there exists a U ∈ Rn×d such that P = UU⊤ and

Hint Use the eigendecomposition of P.

(d) (2 points) The Singular Value Decomposition theorem states that we can write any matrix X as

where σi ≥ 0, and {ui}ni=1 and {vi}
i=1 are orthonormal bases for R

n and Rd respectively. Some of
the singular values σi may equal 0, indicating that the associated left and right singular vectors
ui and vi do not contribute to the sum, but sometimes it is still convenient to include them in
the SVD so we have complete orthonormal bases for Rn and Rd to work with. Show that

(i) {vi : σi > 0} is an orthonormal basis for the row space of of X

https://gwthomas.github.io/docs/math4ml.pdf

(ii) Similarly, {ui : σi > 0} is an orthonormal basis for the columnspace of X
Hint: consider X⊤.

(e) (2 points) Prove that if X ∈ Rn×d and rank(X) = d, then X(X⊤X)−1X⊤ is a rank-d orthogonal
projection matrix. What is the corresponding matrix U, in terms of an SVD of X, such that
X(X⊤X)−1X⊤ = UU⊤ and U⊤U = I?

(f) (3 points) Define the Moore-Penrose pseudoinverse to be the matrix:

(i) Show that X†X is a projection matrix.

(ii) Which subspace does it project onto?

(iii) What is this subspace if rank(X) = d?

(iv) If rank(X) = d and n = d, does X−1 exist? If so, how is X† related to X−1?

3 Some MLEs (11 points)

For this question, assume you observe n (data point, label) pairs (xi, yi)ni=1, with xi ∈ R
d and yi ∈ R

for all i = 1, . . . , n. We denote X as the data matrix containing all the data points and y as the label
vector containing all the labels:



 ∈ Rn×d, y =


 ∈ Rn.
(a) (4 points) Ignoring y for now, suppose we model the data points as coming from a d-dimensional

Gaussian with diagonal covariance:

∀i = 1, . . . , n, xi
∼ N(µ,Σ); Σ =


σ21 . . . 0

0 . . . σ2d

 .
If we consider µ ∈ Rd and (σ21, . . . , σ

d), where each σ

i > 0, to be unknown, the parameter

space here is 2d-dimensional. When we refer to Σ as a parameter, we are referring to the
d-tuple (σ21, . . . , σ

d), but inside a linear algebraic expression, Σ denotes the diagonal matrix

diag(σ21, . . . , σ
d). Compute the log-likelihood ℓ(µ,Σ) = log p(X | µ,Σ), and,

(i) find the MLE of µ assuming Σ is known;

(ii) find the MLE of Σ assuming µ is known;

(iii) find the joint MLE of (µ,Σ).

Justify why your answers maximize the likelihood.

(b) (3 points) Now we consider X as fixed, and not a random matrix. Suppose we model the
outcome y as

∼ N(x⊤i w, σ

2) ∀i = 1, . . . , n,

where w ∈ Rd is an unknown parameter.

(i) Compute the log-likelihood ℓ(w) = log p(y |w) and show that finding the MLE of w is
equivalent to linear regression:

∥Xw − y∥22.

In lecture it was shown that the unique solution when rank(X) = d is w∗ = (XT X)−1X⊤y. In the
case where rank(X) < d, show (ii) if a solution exists, it is not unique; (iii) a solution does exist, namely X†y is a solution. HW1,©UCB CS 189 / 289, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6 Hint: the concepts in Question 2 of this homework may help here. One interesting fact to note is that X†y is the minimum norm solution. You have the tools you need to prove this, and feel free to optionally include a proof in your writeup, but this will not be graded. (c) (2 points) Again considering X fixed, now suppose that each yi ∈ {0, 1} is binary-valued and that we model y as ∼ Ber(s(x⊤i w)) ∀i = 1, . . . , n, where s : R → R, s(z) = 11+e−z is the sigmoid function, and Ber(p) denotes the Bernoulli distribution which takes value 1 with probability p and 0 with probability 1 − p. (i) Write down the log-likelihood ℓ(w) = log p(y |w) and show that finding the MLE of w is equivalent to minimizing the cross entropy between Ber(yi) and Ber(s(x⊤i w)) for each i: H(Ber(yi),Ber(s(x i w))). (1) Definition of cross entropy: given two discrete probability distributions π : Ω → [0, 1] and θ : Ω→ [0, 1] on some outcome space Ω, we have ω∈Ω π(ω) = ω∈Ω θ(ω) = 1 and −π(ω) log θ(ω). (ii) Show that (1) (and therefore finding the MLE) is equivalent to the following problem: where zi = 1 if yi = 1 and zi = −1 if yi = 0. Note: both (1) and (2) are referred to as logistic regression. (d) (2 points) Consider the Categorical(θ1, . . . , θK) distribution (θk ≥ 0∀k and k=1 θk = 1), which takes values in {1, . . . ,K}. A random variable Z with this distribution has P(Z = k) = θk ∀k = 1, . . . ,K. Now, ignoring the data points X, suppose that ∀i = 1, . . . , n, yi ∈ {1, . . . ,K}, and ∼ Categorical(θ1, . . . , θK). Compute the MLE of θ = (θ1, . . . , θK). One method is to use Lagrange multipliers. Another method is to use the fact that the KL divergence is nonnegative: KL(π ∥ θ) = You are free to use either method to compute the MLE of θ. HW1,©UCB CS 189 / 289, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 7 4 Geometry of Ridge Regression (13 points) You recently learned ridge regression and how it differs from ordinary least squares. In this ques- tion we will explore how ridge regression is related to solving a constrained least squares problem in terms of their parameters and solutions. (a) (1 point) Given a matrix X ∈ Rn×d and a vector y ∈ Rn, define the optimization problem ∥y − Xw∥22. (3) subject to ∥w∥22 ≤ β We can utilize Lagrange multipliers to incorporate the constraint into the objective function by adding a term which acts to “penalize” the thing we are constraining. Write down the Lagrangian of this constrained problem. Relevant background on Lagrangians: Given a constrained problem of the form subject to fi(x) ≤ ci ∀ i = 1, . . . , p, where f0, . . . , fp are functions Rd → R and c1, . . . , cp are scalars, its Lagrangian is defined as L(x, λ) = f0(x) + λi( fi(x) − ci). According to duality theory, if all the functions f0, . . . , fp are convex and there exists a strictly feasible point (this happens to be true for our problem, assuming β > 0), then there exists
λ∗ ∈ Rp such that

is the solution to the original constrained problem.

(b) (1 point) Recall that ridge regression is given by the unconstrained optimization problem

∥y − Xw∥22 + ν∥w∥

This means that one way to interpret “ridge regression” is as the Lagrangian form of a con-
strained problem. Qualitatively, how would increasing β in our previous problem be reflected
in the desired penalty ν of ridge regression (i.e. if our threshold β increases, what should we

(c) (1 point) One reason why we might want to have small weights w has to do with the sensitivity
of the predictor to its input. Let x be a d-dimensional list of features corresponding to a new
test point. Our predictor is w⊤x. What is an upper bound on how much our prediction could
change if we added noise ϵ ∈ Rd to a test point’s features x, in terms of ∥w∥2 and ∥ϵ∥2?

(d) (2 points) Derive that the solution to ridge regression (4) is given by ŵr = (X⊤X + νI)−1X⊤y.
What happens when ν → ∞? It is for this reason that sometimes regularization is referred to
as “shrinkage.”

(e) (1 point) Note that in computing ŵr, we are trying to invert the matrix X⊤X + νI instead
of the matrix X⊤X. If X⊤X has eigenvalues σ21, . . . , σ

d, what are the eigenvalues of X

νI? Comment on why adding the regularizer term νI can improve the inversion operation
numerically.

(f) (1 point) Let the number of parameters d = 3 and the number of datapoints n = 5, and let
the eigenvalues of X⊤X be given by 1000, 1 and 0.001. We must now choose between two
regularization parameters ν1 = 100 and ν2 = 0.5. Which do you think is a better choice for this
problem and why?

(g) (2 points) Another advantage of ridge regression can be seen for under-determined systems.
Say we have the data drawn from a d = 5 parameter model, but only have n = 4 training
samples of it, i.e. X ∈ R4×5. Now this is clearly an underdetermined system, since n < d. Show that ridge regression with ν > 0 results in a unique solution, whereas ordinary least
squares has an infinite number of solutions.

Hint: To make this point, it may be helpful to consider w = w0 + w∗ where w0 is in the null
space of X and w∗ is a solution.

(h) (2 points) What will the solution to ridge regression given in part (d) converge to if you take
the limit ν→ 0?

Hint: Use the SVD of X.

(i) (2 points) Tikhonov regularization is a general term for ridge regression, where the implicit
constraint set takes the form of an ellipsoid instead of a ball. In other words, we solve the
optimization problem

w = arg min

∥y − Xw∥22 + ν∥Γw∥

for some full rank matrix Γ ∈ Rd×d. Derive a closed form solution for w.

5 Robotic Learning of Controls from Demonstrations and Images (11 points)

Huey, a home robot, is learning to retrieve objects from a cupboard, as shown in Fig. 1. The goal
is to push obstacle objects out of the way to expose a goal object. Huey’s robot trainer, Anne,
provides demonstrations via tele-operation. When tele-operating the robot, Anne can look at the
images captured by the robot and provide controls to Huey remotely.

During a demonstration, Huey records the RGB images of the scene for each of the n timesteps,
x1, …, xn, where xi ∈ R30×30×3 and the controls for his body for each of the n timesteps, u1, . . . , un,
where ui ∈ R3. The controls correspond to making small changes in the 3D pose (i.e. translation
and rotation) of his body. Examples of the data are shown in the figure.

Under an assumption (sometimes called the Markovian assumption) that all that matters for the
current control is the current image, Huey can try to learn a linear policy π (where π ∈ R2700×3)
which linearly maps image states to controls (i.e. π⊤x = u). We will now explore how Huey can
recover this policy using linear regression. Note please use numpy and numpy.linalg to complete
this assignment.

A) Robot Performing Task B) Dataset

Figure 1: A) Huey trying to retrieve a mustard bottle. An example RGB image of the workspace taken from
his head mounted camera is shown in the orange box. The angle of the view gives Huey and eye-in-hand
perspective of the cupboard he is reaching into. B) A scatter plot of the 3D control vectors, or u labels.
Notice that each coordinate of the label lies within the range of [−1, 1] for the change in position. Example
images, or states x, are shown for some of the corresponding control points. The correspondence is indicated
by the blue lines.

(a) (2 points) To get familiar with the structure of the data, please visualize the 0th, 10th and
20th images in the training dataset. Also find out what’s their corresponding control

(b) (1 points) Load the n training examples from x train.p and compose the matrix X, where X ∈
Rn×2700. Note, you will need to flatten the images to reduce them to a single vector. The
flattened image vector will be denoted by x̄ (where x̄ ∈ R2700×1). Next, load the n examples

from y train.p and compose the matrix U, where U ∈ Rn×3. Try to perform ordinary least
squares by forming the matrix (X⊤X)−1X⊤ to solve

to learn the policy π ∈ R2700×3. Report what happens as you attempt to do this and explain

||Xπ − U ||2F + λ||π||

on the dataset for regularization values λ = {0.1, 1.0, 10, 100, 1000}. Measure the average
squared Euclidean distance for the accuracy of the policy on the training data:

||x̄Ti π − u

(In the above, we are taking the ℓ2 norm of a row vector, which here we take to mean the ℓ2
norm of the column vector we get by transposing it.) Report the training error results for
each value of λ.

(d) (2 points) Next, we are going to try standardizing the states. For each pixel value in each data
point, x, perform the following operation:

Since we know the maximum pixel value is 255, this rescales the data to be between [−1, 1] .
Repeat the previous part and report the average squared training error for each value of

(e) (3 points) Evaluate both policies (i.e. with and without standardization on the new validation
data x test.p and y test.p for the different values of λ. Report the average squared Euclidean
loss and qualitatively explain how changing the values of λ affects the performance in
terms of bias and variance.

(f) (1 point) To better understand how standardizing improved the loss function, we are going to
evaluate the condition number κ of the optimization, which is defined as

σmax(XT X + λI)

or the ratio of the maximum singular value to the minimum singular value of the relevant
matrix. Roughly speaking, the condition number of the optimization process measures how
stable the solution will be when some error exists in the observations. More precisely, given

a linear system Ax = b, the condition number of the matrix A is the maximum ratio of the
relative error in the solution x to the relative error of b.

For the regularization value of λ = 100, report the condition number with the standardiza-
tion technique applied and without.

(12 points) Multivariate Gaussians: A review
Linear Regression, Projections and Pseudoinverses (13 points)
Some MLEs (11 points)
Geometry of Ridge Regression (13 points)
Robotic Learning of Controls from Demonstrations and Images (11 points)