COMS4721 COMS4771 HW3 ML

COMS 4721/4771 HW3 (Spring 2024) Due: Thu April 04, 2024 at 11:59pm
This homework is to be done alone. No late homeworks are allowed. To receive credit, a typesetted copy of the homework pdf must be uploaded to Gradescope by the due date. You must show your work to receive full credit. Discussing possible approaches for solutions for homework questions is encouraged on the course discussion board and with your peers, but you must write your own individual solutions and not share your written work/code. You must cite all resources (including online material, books, articles, ai generative bots, help taken from/given to specific individuals, etc.) you used to complete your work.
It is your responsibility to protect your work, and it is a violation of academic integrity policy to post any part of these questions or your answers to public websites such as: github, bitbucket, chegg, coursehero, etc. Violators will be reported to the dean for disciplinary action.
1 Strange consequences of high dimensionality
As discussed in class, we often represent our data in high dimensions. Thus to understand our data better and design effective prediction algorithms, it is good to understand how things behave in high dimensions. Obviously, since we cannot visualize or imagine high dimensional spaces, we often tend to rely on how data behave in one-, two- or three- dimensions and extrapolate how they may behave in hundreds of dimensions. It turns out that our low dimensional intuition can be very misleading about data and distributions in high dimensional spaces. In this problem we will explore this in more detail.
Consider the Gaussian distribution with mean μ and identity covariance Id in Rd. Recall that the density assigned to any point x ∈ Rd, then becomes
p(x)=(2π)−d/2exp−∥x−μ∥2/2 .
(i) Show that when x = μ, x gets assigned the highest density.
(This, of course, makes sense: the Gaussian density peaks at its mean and thus x = μ has the highest density.)
(ii) Ifmeanhasthehighestdensity,itstandstoreasonthatifwedrawalargei.i.d.samplefromthedistribution,then a large fraction of the points should lie close to the mean. Let’s try to verify this experimentally. For simplicity, let mean μ = 0 (covariance is still Id ). Draw 10,000 points i.i.d. from a Gaussian N (0, Id ).
To see how far away a sampled datapoint is from the mean, we can look at the distance ∥x − μ∥2 = ∥x∥2 (that is, the squared length of the sampled datapoint, when mean is zero). Plot the histogram of squared length of the samples, for dimensions d = 1, 2, 3, 5, 10, 50 and 100. You should plot the all these histograms on the same figure for a better comparison.
What interesting observations do you see from this plot? Do you notice anything strange when the samples that were drawn from the high dimensional Gaussian distribution? Do most of the samples lie close to the mean?
(iii) Let’s mathematically derive where we expect these samples to lie. That is, calculate h 2i
Ex∼N (0,Id ) ∥x∥ .
Is the empirical plot in part (ii) in agreement with the mathematical expression you derived here?
(iv) This “strangeness” is not specific to Gaussian distribution, you can observe something similar even for the simplest of distributions in high dimensions. Consider the uniform distribution over the cube [−1, 1]d . Just like in part (ii), draw 10,000 i.i.d. samples from this d-dimensional cube with uniform density, and plot the histogram
程序代写 CS代考 加QQ: 749389476
of how far away from the origin the sample points lie. (do this for d = 1, 2, 3, 5, 10, 50 and 100, again on the same plot).
Recall that the cube has side length of 2, while most of the high-dimensional samples have length of far more than 2! This means even though you are drawing uniformly from the cube, most of your samples lie in the corners (and not the interior) of the cube!
Again, calculate the expected (squared) length of the samples. That is, calculate
h 2i Ex∼unif([−1,1]d ) ∥x∥ .
Does the plot in part (iv) in agreement with the expression you derive here?
An alternate learning paradigm
In class you have seen that when building classifiers, one wants to minimize the expected classification error over a distribution D. That is, we want to find the classifier f that minimizes:
E(x,y)∼D 1[f(x) ̸= y] . (1) Since this quantity is not estimable in practice (since we don’t know D and only have access to finite samples drawn
from it), it is usually approximated via its empirical equivalent:
1 X 1[f(x) ̸= y]. (2)
This latter quantity is the training error if S is the training set, and is the testing error if S is the testing set.
However for certain applications, obtaining a positively and negatively labelled samples S is not possible. Con- sider, for example, the problem of modelling user preferences based on news-feed that gets shown. Very simply, if a user interacts with a particular news item (such as they clicked and read it) shows that they are interested in the contents of the article, thus providing a positive label. But if a user does not interact with a particular news item, it is not clear whether the user dislikes the contents of the article, or simply didn’t get around to viewing it. In such a scenario obtaining a good quality negatively labelled data sample is not possible. We thus need a slightly different learning paradigm where the training samples obtained are only either labelled as positive examples, or they are simply
unlabeled examples. We can model this as follows:
• D is an unknown distribution over RD ×{0, 1} × {0, 1}. (x, y, s) ∼ D is a sample, where x is the input feature vector, y is the true label, and s (the “selection” variable) is whether x was interacted with (ie, selected) or not. Note that only x and s are observed.
• Pr[s = 1 | x, y = 0] = 0, that is, a negatively labelled x is never selected.
• Given y, s and x are conditionally independent. That is, which x gets selected (given that, say, x positively
labelled) is chosen independently.
The goal of this problem is to find an empirical estimator of (1) similar to (2) but using the unlabeled and positive data only.
(i) Prove that Pr[y = 1 | x] = Pr[s=1|x] . Pr[s=1|y=1]
(ii) Using (i) prove that Pr[y = 1 | x, s = 0] = 1−Pr[s=1|y=1] Pr[s=1|x] . Pr[s=1|y=1] 1−Pr[s=1|x]
For the rest of the problem, assume that both quantities on the RHS can be estimated from (x, s) data only. This is trivially true for Pr[s = 1 | x] (since it does not depend on y). And while estimating Pr[s = 1 | y = 1] with only(x, s) data is nontrivial, it can be done under suitable conditions.

(iii) Letting p denote the PDF of D show that:
+ Pr[y = 0 | s = 0, x]1[f (x) ̸= 0])dx
(iv) Using parts (ii) and (iii) suggest an empirical estimator of (1) similar to (2) but that uses only (x, s) data.
Hint: Try viewing unlabeled points as part positive and part negative. That is, replace unlabeled points by two “partial” points. One that is positive with weight w(x) and one negative with weight 1 − w(x).
3 Bayesian interpretation of ridge regression
Consider the following data generating process for linear regression problem in Rd. Nature first selects d weight coefficients w1,…,wd as wi ∼ N(0,τ2) i.i.d. Given n examples x1,…,xn ∈ Rd, nature generates the output variable yi as
yi =Xwjxi,j +εi,
whereεi ∼N(0,σ2)i.i.d.
Show that finding the coefficients w1, . . . , wd that maximizes P [w1, . . . , wd|(x1, y1) . . . , (xn, yn)] is equivalent to
minimizing the ridge optimization criterion.
4 1-Norm Support Vector Machine
Recall the standard support vector machine formulation:
minimize ∥w∥2
subjectto yi(w·xi +w0)≥1, i=1,…,m,
where m is the number of points and n is the number of dimensions, is a quadratic program because the objective function is quadratic and the constraints are affine. A linear program on the other hand uses only affine objective function and constraints, and there are efficient polynomial-time algorithms for solving them. By replacing the 2- norm in the objective function with the 1-norm (∥x∥1 = Pni=1 |xi|), we get
minimize ∥w∥1
subjectto yi(w·xi +w0)≥1, i=1,…,m.
Note that this is still not a linear program because the objective value contains absolute values, but we will convert this
problem into an equivalent linear problem.
(i) Suppose we know that the output y depends only on a few input variables (i.e. the optimal w is sparse). Would the 1-norm or 2-norm SVM make more sense? Justify your answer.
(ii) The Chebyshev (l∞) distance between two points x and y is defined as maxi |xi − yi|. Show that the 1-norm SVM maximizes the Chebyshev distance between the two separating hyperplanes w · x + w0 = ±1, which is the smallest value of ||x − y||∞ such that x is a point on the first hyperplane and y is a point on the second hyperplane.
Hint: Show that the l∞ distance from the origin to the plane w · x = 2 is minimized along the vector (sign(w1 ), . . . sign(wn )). Then, the distance from the origin to w · x = 2 is the value of λ satisfying w · λ(sign(w1), . . . , sign(wn)) = 2.
E(x,y)∼D1[f(x) ̸= y] =
p(x,s = 1)1[f(x) ̸= 1]
+ p(x, s = 0)(Pr[y = 1 | s = 0, x]1[f (x) ̸= 1]
Code Help, Add WeChat: cstutorcs
(iii) Consider the unconstrained minimization problem
minimize max(ax + b, cx + d),
over x ∈ R. Argue that (4) is equivalent to the following linear program with an auxiliary variable:
minimize t subjectto t≥ax+b,
t ≥ cx + d.
In other words, show that both problems will always obtain the same optimal objective value.
Note that by increasing the dimension of the problem’s decision space, we have transformed the original problem into a linear program.
(iv) Observe that the 1-norm function ||w||1 in (3) can be written as Pni=1 max(wi , −wi ). Using this fact and the equivalency from part (iii), rewrite (3) as a linear program with 2n + 1 variables and m + 2n constraints.
(v) When the input data are not perfectly separable, we can apply a soft-margin approach (this is an alternative to the usual slack-variables approach discussed in class):
minimize ∥w∥1 +X[1−yi(w·xi +w0)]+ , (5)
where [·]+ is the hinge loss function given by max(0, ·). Note that we’ve replaced the constraints with a penalty in the objective function. Rewrite the soft-margin 1-norm SVM problem as a linear program. How many variables and constraints are present in this formulation?
(vi) Extra credit: Duality is a rich topic in optimization theory because it provides an alternate lens for viewing optimization problems. It is typically not possible to use duality to find closed-form solutions to optimization problems as we did in class. Nevertheless, duality gives rise to other ways of solving and analyzing problems. For example, weak daulity says that any feasible dual solution gives a lower bound on the p∗, the optimal value of the primal. Also, adding a constraint to the primal is equivalent to adding a variable to the dual, which is really useful in delayed constraint generation in which complex linear programs are solved by iteratively adding constraints to a smaller problem.
Show that the dual of the linear program constructed in part (v) can be expressed as
maximize ∥π∥1 m
subject to
Xyixijπi i=1
Xyiπi =0, i=1
j = 1,…,n,
where π ∈ Rm 1. Since strong duality always holds for linear programs, this problem is equivalent to the
soft-margin formulation in (3).
Hint: Think about what constraints need to be satisfied for minx L(x, λ) to be bounded. You do not have to take any partial derivatives.
1This formulation is taken from https://arxiv.org/pdf/1901.01585.pdf, which describes ways of combining the primal and dual formulations to efficiently solve the 1-norm SVM. It is a good exploration of many important optimization topics beyond the scope of this class.
i=1,…,m,
Computer Science Tutoring