CS 189 / 289A Introduction to Machine Learning

CS 189 / 289A Introduction to Machine Learning
Fall 2022 Jennifer Listgarten, Jitendra Malik HW3
Due 10/14 at 11:59pm

• Homework 3 consists only of written questions.

• We prefer that you typeset your answers using LATEX or other word processing software.
If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a
good time! Neatly handwritten and scanned solutions will also be accepted for the written
questions.

• In all of the questions, show your work, not just the final answer.

• This homework is due a bit earlier than usual to give you time to prepare for Midterm 1.

Deliverables:

Submit a PDF of your homework to the Gradescope assignment entitled “HW3 Write-Up”. Please
start each question on a new page. If there are graphs, include those graphs in the correct
sections. Do not put them in an appendix. We need each solution to be self-contained on pages of

• In your write-up, please state with whom you worked on the homework. If you worked by
yourself, state you worked by yourself. This should be on its own page and should be the
first page that you submit.

• In your write-up, please copy the following statement and sign your signature underneath. If
you are using LaTeX, you must type your full name underneath instead. We want to make it
extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my
own words and that I have not looked at another student’s solutions. I have given credit to
all external sources I consulted.”

1 Logistic posterior with exponential class conditionals (6 points)

We have seen in class that Gaussian class conditionals can lead to a logistic posterior that is linear
in X. Now, suppose the class conditionals are exponentially distributed with parameters λi, i.e.

p(x|Y = i) = λi exp(−λix), where i ∈ {0, 1}, Y ∼ Bernoulli(π)

(a) (4 points) Show that the posterior distribution of the class label given X is also a logistic
function, however with a linear (affine) argument in X. That is, p(Y = 1|x) = 1

1+exp(−h(x)) where
h is an affine function of x involving λ0, λ1, and π, that you should find.

(b) (2 points) Assuming λ0 < λ1, what is the decision boundary and decision rule of the Bayes classifier? HW3,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 2 2 Gaussian Classification (6 points) Let P(x | Ci) ∼ N(µi, σ2) for a two-class, one-dimensional classification problem with classes C1 and C2 with prior probabilities P(C1) = P(C2) = 1/2, and µ2 > µ1.

(a) (3 points) Find the Bayes optimal decision boundary and the corresponding Bayes decision

(b) (3 points) The Bayes error is the probability of misclassification,

Pe = P(classified as C1 | C2) P(C2) + P(classified as C2 | C1) P(C1).

Show that the Bayes error associated with this decision rule is

3 Poisson Classification (12 points)

Recall that the probability mass function (PMF) of a Poisson random variable is

P(X = x) = e−λ

x ∈ {0, 1, . . . ,∞}

The PMF is defined for non-negative integral values.

You are given a sample containing a radioactive isotope. You know it is one of two potential
isotopes: ω1 and ω2, but you do not know which one. A priori, it is equally likely to be either one,
i.e. P(ω1) = P(ω2) =

Using a Geiger counter, you measure x, the number of decay events from the sample within a
fixed time interval. This follows a Poisson distribution with mean dependent on the isotope. Thus,
x|ω1 ∼ Poisson(λ1) and x|ω2 ∼ Poisson(λ2), where λ1 and λ2 are the expected number of decay
events for each isotope considered. λ1 and λ2 can be treated as known; one can compute them by
looking up the decay rate and atomic weight of each isotope and weighing the sample.

(a) (2 points) Find P(ω1|x) in terms of λ1 and λ2. Show that the posterior P(ω1|x) is a logistic
function 11+e−h(x) and write h(x) in terms of λ1, λ2, and x.

(b) (4 points) Find the Bayes optimal rule (decision boundary) for classifying the observation x as
either isotope. In the case where P(ω1) = P(ω2) =

2 and λ1 = 10 and λ2 = 15, calculate the

decision boundary and the 2 × 2 confusion matrix C ∈ [0, 1]2×2 :

P(ω̂(x) = ω1, ω1) P(ω̂(x) = ω1, ω2)P(ω̂(x) = ω2, ω1) P(ω̂(x) = ω2, ω2)

where ω̂(x) denotes the class predicted by the Bayes optimal rule. For reference, P(ω̂(x) =
ω1, ω1) denotes the joint probability that ω̂(x) = ω1 and the true class is ω1. Hint: first,
compute the diagonal elements.

Using the confusion matrix, compute the total error rate, precision, recall, and F1 score. F1
score is the harmonic mean of precision and recall, and has the formula

2T P + FP + FN

2 · Precision · Recall
Precision + Recall

For the latter three metrics, treat ω1 as the positive class and ω2 as the negative class (perhaps
ω1 is a radioactive isotope you are particularly interested in).

(c) (4 points) Suppose instead of one, we can obtain two independent measurements x1 and x2
from the radioactive substance. How does the classification rule change? In the case where
P(ω1) = P(ω2) =

2 and λ1 = 10 and λ2 = 15, recalculate the confusion matrix and metrics

from the previous part.

(d) (2 points) Comment on how your answers to part (c) compare to your answers to part (b). Give
an explanation for what you observe.

4 Exploring Bias Variance between Ridge and Least Squares Regression
(18 points)

Recall the statistical model for ridge regression from lecture. We have a design matrix X, where
the rows of X ∈ Rn×d are our data points xi ∈ Rd. We assume a linear regression model

Y = Xw∗ + z

Where w∗ ∈ Rd is the true parameter we are trying to estimate, z = [z1, . . . , zn]⊤ ∼ N(0, σ2In), and
Y = [y1, . . . , yn]⊤ is the random variable representing our labels.

Throughout this problem, you may assume X⊤X is invertible. Given a realization of the labels
Y = y, recall these two estimators we have studied:

wols = min
∥Xw − y∥22

wridge = min
∥Xw − y∥22 + λ∥w∥

(a) (1 point) Write the solution for wols,wridge. No need to derive it.

(b) (2 points) Let ŵ ∈ Rd denote any estimator of w∗. In the context of this problem, an estimator
ŵ = ŵ(Y) is any function which takes the data X and a realization of Y , and computes a guess
Define the MSE (mean squared error) of the estimator ŵ as

MSE(ŵ) := E
[∥∥∥ŵ − w∗∥∥∥2

Above, the expectation is taken w.r.t. the randomness inherent in z. Note that this is a multi-
variate generalization of the mean squared error we have seen previously.

Define µ̂ := E[ŵ]. Show that the MSE decomposes as such

∥∥∥̂µ − w∗∥∥∥2

2︸︷︷︸

+Tr(Cov(ŵ))︸︷︷︸

Note that this is a multivariate generalization of the bias-variance decomposition we have seen
previously.

Hint: Given vectors x and y, the following equality holds: ⟨x, y⟩ = Tr
. Also, expectation

and trace commute, so E[Tr(A)] = Tr(E[A]) for any square matrix A.

E[wols] = w∗, E[wridge] = (X⊤X + λId)−1X⊤Xw∗ .

That is, Bias(wols) = 0, and hence wols is an unbiased estimator of w∗, whereas wridge is a
biased estimator of w∗.

(d) (4 points) Let {γi}di=1 denote the d eigenvalues of the matrix X
⊤X. Show that

Tr(Cov(wols)) = σ2

, Tr(Cov(wridge)) = σ2

The quantity

is referred to as the degree of freedom in ridge regression.

Finally, use these formulas to conclude that

Var(wridge) < Var(wols) . Note that this is opposite of the relationship between the bias terms. In the next parts, we will explore this trade-off further. Hint: Recall that for a random vector v, its covariance matrix is defined as Cov(v) = E[(v − E[v])(v − E[v])⊤]. Also, recall that if A and B are constant matrices and M is a random matrix, then E[AMB] = AE[M]B. Hint: Remember the relationship between the trace and the eigenvalues of a matrix. Also, for the ridge variance, consider writing X⊤X in terms of its eigen-decomposition UΣU⊤. Note that X⊤X + λId has the eigendecomposition U(Σ + λId)U⊤. (e) (6 points) Let’s use the bias and variance trade-off to study the choice between OLS and Ridge Regression. (i) Using the bias-variance decomposition, show that MSE(wols) = σ2 (ii) Using the bias-variance decomposition, show that MSE(wridge) is bounded above as, MSE(wridge) ≤ σ2 Hint: Use following Cauchy-Schwarz matrix vector product inequality in your upper bound: ∥Ax∥22 ≤ ∥A∥ ⊤A), where λi(A⊤A) is the ith eigenvalue of A⊤A. (In fact, a tighter bound holds, where ∥A∥2F is replaced with ∥A∥ 2 = λmax(A the looser bound will be easier to minimize.) (iii) Show that the upper-bound is tight for λ = 0, meaning that when λ = 0, the upper-bound equals MSE(wridge). (iv) Optimize the upper-bound, showing that the lowest value is achieved at which is related to the inverse of the widely used signal to noise ratio statistic (SNR) in signal processing and machine learning. You may use without proof the fact that the upper-bound is convex as a function of λ, implying a stationary point is a global HW3,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 6 (f) (2 points) In practice, both ∥w∗∥2 and σ are unknown. Show that at any fixed ∥w∗∥2 and σ, there exists a λ > 0 such that, MSE(wridge) < MSE(wOLS). How would you find such a λ in practice? (g) (1 point) Suppose, a sensor suddenly degrades multiplying the noise z in recorded observation by a factor of 3. Guided by the form of the minimizer of the MSEridge upper-bound, how would you adjust the regularization λ from its old value of 21 to the new setting? HW3,©UCB CS 189 / 289A, Fall 2022. All Rights Reserved. This may not be publicly shared without explicit permission. 7 Logistic posterior with exponential class conditionals (6 points) Gaussian Classification (6 points) Poisson Classification (12 points) Exploring Bias Variance between Ridge and Least Squares Regression (18 points)