CS 189 hw4

CS 189 Introduction to Machine Learning
Spring 2023 Jonathan Shewchuk HW4
Due: Friday, March 10 at 11:59 PM PST
• Homework 4 consists of coding assignments and math problems.
• We prefer that you typeset your answers using LATEX or other word processing software. If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a good time! Neatly handwritten and scanned solutions will also be accepted.
• In all of the questions, show your work, not just the final answer.
• We will not provide points back with respect to homework submission errors. This in- cludes, but is not limited to: 1) not assigning pages to problems; 2) not including code in the write-up appendix; 3) not including code in the designated code Gradescope assignment; 4) not including Kaggle scores; 5) submitting code that only partially works; 6). submitting late regrade requests. Please carefully read and follow the HW submission guidelines/re- minders for Pages 1, 2, and 10 of HW 4.
• Start early; you can submit models to Kaggle only twice a day! Deliverables:
1. Submit your predictions for the test sets to Kaggle as early as possible. Include your Kaggle scores in your write-up. The Kaggle competition for this assignment can be found at
• WINE: https://www.kaggle.com/competitions/spring23-cs189-hw4-wine/
2. Write-up: Submit your solution in PDF format to “Homework 4 Write-Up” in Gradescope.
• On the first page of your write-up, please list students with whom you collaborated
• Start each question on a new page. If there are graphs, include those graphs on the same pages as the question write-up. DO NOT put them in an appendix. We need each
solution to be self-contained on pages of its own.
• Only PDF uploads to Gradescope will be accepted. You are encouraged use LATEX
or Word to typeset your solution. You may also scan a neatly handwritten solution to
produce the PDF.
• Replicate all your code in an appendix. Begin code for each coding question in a fresh
page. Do not put code from multiple questions in the same page. When you upload this PDF on Gradescope, make sure that you assign the relevant pages of your code from appendix to correct questions.
• While collaboration is encouraged, everything in your solution must be your (and only your) creation. Copying the answers or code of another student is strictly forbidden.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 1

Furthermore, all external material (i.e., anything outside lectures and assigned readings, including figures and pictures) should be cited properly. We wish to remind you that consequences of academic misconduct are particularly severe!
3. Code: Submit your code as a .zip file to “Homework 4 Code”.
• Set a seed for all pseudo-random numbers generated in your code. This ensures your results are replicated when readers run your code. For example, you can seed numpy with np.random.seed(189).
• Include a README with your name, student ID, the values of random seed (above) you used, and instructions for running (and compiling, if appropriate) your code.
• Do NOT provide any data files. Supply instructions on how to add data to your code.
• Code requiring exorbitant memory or execution time might not be considered.
• Code submitted here must match that in the PDF Write-up. The Kaggle score will not be
accepted if the code provided a) does not compile or b) compiles but does not produce the file submitted to Kaggle.
Notation: In this assignment we use the following conventions.
• Symbol “defined equal to” (≜) defines the quantity to its left to be the expression to its right.
• Scalars are lowercase non-bold: x, u1, αi. Matrices are uppercase alphabets: A, B1, Ci. Vec-
tors (column vectors) are in bold: x, α1, X, Yj. √
• ∥v∥ denotes the Euclidean norm (length) of vector v: ∥v∥ ≜ v · v. ∥A∥ denotes the (operator)
norm of matrix A, the magnitude of its largest singular value: ∥A∥ = max∥v∥=1 ∥Av∥.
• [n] ≜ {1, 2, 3, . . . , n}. 1 and 0 denote the vectors with all-ones and all-zeros, respectively.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 2

1 Honor Code
Declare and sign the following statement (Mac Preview, PDF Expert, and FoxIt PDF Reader, among others, have tools to let you sign a PDF file):
“I certify that all solutions are entirely my own and that I have not looked at anyone else’s solution. I have given credit to all external sources I consulted.”
Signature:
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 3

2 Logistic Regression with Newton’s Method
Given examples x1, x2, . . . , xn ∈ Rd and associated labels y1, y2, . . . , yn ∈ {0, 1}, the cost function for unregularized logistic regression is
Define the n × d design matrix X (whose ith row is x⊤i ), the label n-vector y ≜ [y1 . . . yn]⊤, and s ≜ [s1 … sn]⊤. For an n-vector a, let lna ≜ [lna1 … lnan]⊤. The cost function can be rewritten in vector form as
J(w) = −y · ln s − (1 − y) · ln (1 − s).
Further, recall that for a real symmetric matrix A ∈ Rd×d, there exist U and Λ such that A = UΛU⊤ is the eigendecomposition of A. Here Λ is a diagonal matrix with entries {λ1, …, λd}. An alternative notation is Λ = diag(λi), where diag() takes as input the list of diagonal entries, and constructs the corresponding diagonal matrix. This notation is widely used in libraries like numpy, and is useful for simplifying some of the expressions when written in matrix-vector form. For example, we can write s = diag(si) 1.
􏰃   y lns +(1−y)ln(1−s) iiii
where si ≜ s(xi · w), w ∈ Rd is a weight vector, and s(γ) ≜ 1/(1 + e−γ) is the logistic function.
Hint: Recall matrix calculus identities. The elements in bold indicate vectors.
∇ αy=􏰀∇ α􏰁y⊤ +α∇ y ∇ 􏰀y·z􏰁=􏰀∇ y􏰁z+􏰀∇ z􏰁y;
xxxxxx ∇xf(y) = 􏰀∇xy􏰁􏰀∇yf(y)􏰁; ∇xg(y) = 􏰀∇xy􏰁􏰀∇yg(y)􏰁;
and ∇xCy(x) = 􏰀∇xy(x)􏰁C⊤, where C is a constant matrix.
1 Derive the gradient ∇w J(w) of cost J(w) as a matrix-vector expression. Also derive all inter- mediate derivatives in matrix-vector form. Do NOT specify them (including the intermedi- ates) in terms of their individual components (e.g. wi for vector w).
2 Derive the Hessian ∇2w J(w) for the cost function J(w) as a matrix-vector expression.
3 Write the matrix-vector update law for one iteration of Newton’s method, substituting the
gradient and Hessian of J(w).
4 You are given four examples x1 = [0.2 3.1]⊤, x2 = [1.0 3.0]⊤, x3 = [−0.2 1.2]⊤, x4 =
[1.0 1.1]⊤ with labels y1 = 1, y2 = 1, y3 = 0, y4 = 0. These points cannot be separated by a line passing through origin. Hence, as described in lecture, append a 1 to each xi∈[4] and use a weight vector w ∈ R3 whose last component is the bias term (called α in lecture). Begin
with initial weight w = −1 1 0 . For the following, state only the final answer with
four digits after the decimal point. You may use a calculator or write a program to solve for these, but do NOT submit any code for this part.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 4
程序代写 CS代考加微信: cstutorcs
(a) State the value of s(0) (the initial value of s).
(b) State the value of w(1) (the value of w after 1 iteration). (c) State the value of s(1) (the value of s after 1 iteration). (d) State the value of w(2) (the value of w after 2 iterations).
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 5
CS Help, Email: tutorcs@163.com
3 Wine Classification with Logistic Regression
The wine dataset data.mat consists of 6,497 sample points, each having 12 features. The descrip- tion of these features is provided in data.mat. The dataset includes a training set of 6,000 sample points and a test set of 497 sample points. Your classifier needs to predict whether a wine is white (class label 0) or red (class label 1).
Begin by normalizing the data with each feature’s mean and standard deviation. You should use training data statistics to normalize both training and validation/test data. Then add a fictitious dimension. Whenever required, it is recommended that you tune hyperparameter values with cross- validation.
Please set a random seed whenever needed and report it.
Use of automatic logistic regression libraries/packages is prohibited for this question. If you are coding in python, it is better to use scipy.special.expit for evaluating logistic functions as its code is numerically stable, and doesn’t produce NaN or MathOverflow exceptions.
1 Batch Gradient Descent Update. State the batch gradient descent update law for logistic regression with l2 regularization. As this is a “batch” algorithm, each iteration should use every training example. You don’t have to show your derivation. You may reuse results from your solution to question 4.1.
2 Batch Gradient Descent Code. Implement your batch gradient descent algorithm for logis- tic regression and include your code here. Choose reasonable values for the regularization parameter and step size (learning rate), specify your chosen values in the write-up, and train your model from question 3.1. Shuffle and split your data into training/validation sets and mention the random seed used in the write-up. Plot the value of the cost function versus the number of iterations spent in training.
3 Stochastic Gradient Descent (SGD) Update. State the SGD update law for logistic regression with l2 regularization. Since this is not a “batch” algorithm anymore, each iteration uses just one training example. You don’t have to show your derivation.
4 Stochastic Gradient Descent Code. Implement your stochastic gradient descent algorithm for logistic regression and include your code here. Choose a suitable value for the step size (learning rate), specify your chosen value in the write-up, and run your SGD algorithm from question 3.3. Shuffle and split your data into training/validation sets and mention the random seed used in the write-up. Plot the value of the cost function versus the number of iterations spent in training.
Compare your plot here with that of question 3.2. Which method converges more quickly? Briefly describe what you observe.
5 Instead of using a constant step size (learning rate) in SGD, you could use a step size that slowly shrinks from iteration to iteration. Run your SGD algorithm from question 3.3 with a step size εt = δ/t where t is the iteration number and δ is a hyperparameter you select
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 6

empirically. Mention the value of δ chosen. Plot the value of cost function versus the number of iterations spent in training.
How does this compare to the convergence of your previous SGD code?
6 Kaggle. Train your best classifier on the entire training set and submit your prediction on the test sample points to Kaggle. As always for Kaggle competitions, you are welcome to add or remove features, tweak the algorithm, and do pretty much anything you want to improve your Kaggle leaderboard performance except that you may not replace or augment logistic regression with a wholly different learning algorithm. Your code should output the predicted labels in a CSV file.
Report your Kaggle username and your best score, and briefly describe what your best clas- sifier does to achieve that score.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 7

4 A Bayesian Interpretation of Lasso
Suppose you are aware that the labels yi∈[n] corresponding to sample points xi∈[n] ∈ Rd follow the density law
1 −(yi −w·xi )2 /(2σ2 ) f(yi|xi,w)= √ e
where σ > 0 is a known constant and w ∈ Rd is a random parameter. Suppose further that experts
have told you that
• each component of w is independent of the others, and
• each component of w has the Laplace distribution with location 0 and scale being a known
constant b. That is, each component wi obeys the density law f(wi) = e−|wi|/b/(2b).
Assume the outputs yi∈[n] are independent from each other.
Your goal is to find the choice of parameter w that is most likely given the input-output examples (xi, yi)i∈[n]. This method of estimating parameters is called maximum a posteriori (MAP); Latin for “maximum [odds] from what follows.”
1. Derive the posterior probability density law f(w|(xi,yi)i∈[n]) for w up to a proportionality constant by applying Bayes’ Theorem and substituting for the densities f (yi |xi , w) and f (w). Don’t try to derive an exact expression for f(w|(xi,yi)i∈[n]), as the denominator is very in- volved and irrelevant to maximum likelihood estimation.
2. Define the log-likelihood for MAP as l(w) ≜ ln f (w|xi∈[n], yi∈[n]). Show that maximizing the MAP log-likelihood over all choices of w is the same as minimizing 􏰂n (y − w · x )2 + λ∥w∥
i=1i i 1 where ∥w∥1 = 􏰂dj=1 |wj| and λ is a constant. Also give a formula for λ as a function of the
distribution parameters.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 8
Code Help
5 l1-regularization, l2-regularization, and Sparsity
You are given a design matrix X (whose ith row is sample point x⊤i ) and an n-vector of labels y ≜ [y1 … yn]⊤. For simplicity, assume X is whitened, so X⊤X = nI. Do not add a fictitious dimension/bias term; for input 0, the output is always 0. Let x∗i denote the ith column of X.
1. The lp-norm for w ∈ Rd is defined as ||w||p = (􏰂di=1 |wi|p)1/p, where p > 0. Plot the isocontours with w ∈ R2, for the following norms.
(a) l0.5 (b) l1 (c) l2
Use of automatic libraries/packages for computing norms is prohibited for the question.
2. Show that the cost function for l1-regularized least squares, J1(w) ≜ ∥Xw − y∥2 + λ∥w∥1 (where λ > 0), can be rewritten as J1(w) = ∥y∥2 + 􏰂di=1 f (x∗i, wi) where f (·, ·) is a suitable function whose first argument is a vector and second argument is a scalar.
3. Using your solution to part 2, derive necessary and sufficient conditions for the ith component of the optimizer w∗ of J1(·) to satisfy each of these three properties: w∗i > 0,w∗i = 0, and w∗i <0. 4. For the optimizer w# of the l2-regularized least squares cost function J2(w) ≜ ∥Xw − y∥2 + λ∥w∥2 (where λ > 0), derive a necessary and sufficient condition for w#i = 0, where w#i is the ith component of w#.
5. A vector is called sparse if most of its components are 0. From your solution to part 3 and 4, which of w∗ and w# is more likely to be sparse? Why?
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 9

Submission Checklist
Please ensure you have completed the following before your final submission.
At the beginning of your writeup…
1. Have you copied and hand-signed the honor code specified in Question 1?
2. Have you listed all students (Names and ID numbers) that you collaborated with?
In your writeup for Question 3…
1. Have you included your Kaggle Score and Kaggle Username? At the end of the writeup…
1. Have you provided a code appendix including all code you wrote in solving the homework?
Executable Code Submission
1. Have you created an archive containing all “.py” files that you wrote or modified to generate your homework solutions?
2. Have you removed all data and extraneous files from the archive?
3. Have you included a README file in your archive containing any special instructions to repro- duce your results?
Submissions
1. Have you submitted your written solutions to the Gradescope assignment titled HW4 Write- Up and selected pages appropriately?
2. Have you submitted your executable code archive to the Gradescope assignment titled HW4 Code?
3. Have you submitted your test set predictions for Wine dataset to the appropriate Kaggle challenge?
Congratulations! You have completed Homework 4.
HW4, ©UCB CS 189, Spring 2023. All Rights Reserved. This may not be publicly shared without explicit permission. 10