COMS 4771 (SPRING 2021) EXAM 2 APRIL 16, 2021
Name & UNI:
Instructions
• You have a 24 hour period (from Friday April 16, 2021 00:01 AM EDT to Friday April 16, 2021 11:59 PM EDT) to take this exam. You may open the exam at any time in this 24 hour period, but as soon as you open the exam you will only have two hours and a half hours to complete it and submit your responses on Courseworks. In particular, you only have 2 hours to complete the exam itself. You have an extra 30 minutes to print the exam (if you want) at the beginning, and then scan and submit your responses at the end. Do NOT use the extra 30 minutes as additional time to complete the exam, as this time is very critical for you to print/scan/upload your work.
• You may either print this exam and write your answers in the space provided or you may write your answers on separate blank pages. In the latter case, use a different page for each of the problems and clearly label which part of which problem you are answering. In either case you must scan and upload your solutions to Courseworks within 2.5 hours of the time you open this exam. Since you are given 30 minutes to resolve any “technical issues”, we will NOT accept any work once the 2.5 hour window closes. There are absolutely no exceptions to this rule.
• This is an open book exam. Hence, you may use any resource of your choice to complete the exam (books, lecture slides, lecture videos, homework assignments and solutions, etc.). However, you may not receive help from any other person (whether student or not, whether in person or virtually).
• Do not discuss this exam with anyone whether in person or virtually until the end of the 24 hour exam period (11:59 PM EDT on Friday). In particular do NOT post any questions about the exam content on Piazza.
• Even after the 24 hour period and after this class ends do not share this exam. In particular it would be a breach of academic honesty to post this exam online or share it with a future COMS 4771 student.
• Show all your work! Right answers without supporting work will not be given full credit and wrong answers with supporting work may receive partial credit.
• You may use/cite any result from the lectures or the homeworks without proof.
• Suggestion: Do not spend too much time on any one question. In particular, I strongly suggest that you do not try to Google answers that you do not know. Every question is new (and so will not be found on the Internet) and trying to learn a concept that you do not understand would not be a wise use of your very limited time.
A signed version of the honor pledge below must appear in your submission. If you choose not to print the exam then you must copy and sign the honor pledge onto the first blank sheet of paper that you use to write your answers.
Honor pledge: I have not given or received unauthorized assistance on this examination. I will not retain or re-distribute any handwritten or electronic copies of this examination, either in part or in full.
Sign here: Date:
COMS 4771 (Spring 2021) Exam 2 Page 2 of 15
1. (20 points) For each statement below state either True or False. Justify your answer by providing a short expla- nation/proof sketch (for true statements) or a counter example (for false statements).
Any non-convex optimization problem is NP-hard to solve. In other words, there are no non-convex optimization problems for which we have algorithms to find the exact optimal solution in polynomial time.
SVM with slack (with a fixed hyperparameter C) always returns the linear classifier with the smallest training error.
All decision trees with a single non-leaf node (i.e. all decision trees that make a single split) are linear classifiers.
Any function class with finite VC dimension is efficiently PAC learnable.
CS Help, Email: tutorcs@163.com
COMS 4771 (Spring 2021) Exam 2 Page 3 of 15
A function class with infinite VC dimension is never efficiently PAC learnable, but it may still be (non-efficiently) PAC learnable.
The coefficients of the weight vector returned by Ridge regression will tend to be larger (either more positive or more negative) than those of the weight vector returned by OLS.
Suppose we train a multivariate Gaussian probabilistic classifier on binary labeled data and also train a Gaussian Mixture Model with k = 2 on the same training data (the labels are ignored when training the Gaussian Mixture Model). The two models will always have the same boundary.
Given a dataset S = {(x1, y1), (x2, y2), …, (xn, yn)} with xi 2 RD and yi 2 {0, 1} suppose we use PCA (applied only on the xi, the yi are ignored) to find a transformation : RD ! Rd with d < D. Let S0 = {( (x1), y1), ( (x2), y2), ..., ( (xn), yn)}. Then the minimum training error of D-dimensional linear classifiers on S is always less than or equal to the minimum training error of d-dimensional linear classifiers on S0.
COMS 4771 (Spring 2021) Exam 2 Page 4 of 15
Given a dataset S = {(x1, y1), (x2, y2), ..., (xn, yn)} with xi 2 RD and yi 2 {0, 1} suppose we use a non-linear dimensionality reduction technique (again the yi are ignored) to find a transformation : RD ! Rd with d < D. Let S0 = {( (x1), y1), ( (x2), y2), ..., ( (xn), yn)}. Then the minimum training error of D-dimensional linear classifiers on S is always less than or equal to the minimum training error of d-dimensional linear classifiers on S0.
The partition induced by the Lloyd’s method for k-means optimization always results in convex cells. That is, let c , . . . , c 2 Rd be the solution returned by the algorithm on a given dataset.
Sj := x2Rd | argminkx cik2 =j forj =1,...,k.
Then each Sj is necessarily a convex set.
COMS 4771 (Spring 2021) Exam 2 Page 5 of 15
2. [Facilities location via clustering] You are hired as the lead data scientist in the city planning office. As your first important project your boss tells you that they have received funding to build k hospitals throughout the city. The city has identified m > k different potential sites {s1 , . . . , sm } to build these hospitals. The goal obviously is to pick k sites that collectively minimize the worst-case commute distance to the closest hospital.
Moreformally,youaregivennhouseholdsX = {x1,…,xn} ⇢ R2,andmpotentialsitesS = {s1,…,sm} ⇢ R2, and a number k. You goal is to select k sites (let’s collectively call them centers {c1, . . . , ck} = C ⇢ S), that minimizes the largest (worst-case) Euclidean distance between a household xi and its closest center cj . In other words,
argmin “max min kxi cjk2# (1) C⇢S s.t. |C|=k xi2X cj2C
(a) (5 points) Your boss identifies it as a clustering problem (finding k hospital centers to “group” and serve n households), and proposes that any reasonable k-means algorithm should be able to give a good solution to this problem. Show that even when k = 1 and there is no restriction on the cite locations (that is S = R2), the optimal 1-means solution (k = 1) is not necessarily the optimal solution for the objective function stated above (Eq. 1).
Code Help
COMS 4771 (Spring 2021) Exam 2 Page 6 of 15
(b) (12 points) Having demonstrated to your boss (in part (a)) that this optimization problem is different from the classic k-means problem, you are tasked with coming up with a methodology to find the centers. Given X, S and k, give an algorithm that returns the set of k centers C that exactly minimizes the given objective (1).
(While your algorithm need not be ‘efficient’, more points will be awarded to an algorithm that runs faster while still returning a correct solution.)
(c) (3 points) What is the run time (in terms of n, m and k) of your stated algorithm?
COMS 4771 (Spring 2021) Exam 2 Page 7 of 15
3. [Probabilistic linear embeddings] Having recently learned about linear dimension reduction, you decided to explore linear maps for yourself.
Given a dataset X = {x1, . . . , xn} ⇢ RD on n points. You want to study the effects of applying a d ⇥ D matrix P to the dataset (d < D).
Since distances between pairs of datapoints is an important property, you want to study how P distorts interpoint Euclidean distances.
(a) (3 points) For two arbitrary (but fixed) datapoints x and x0 from the dataset X, the squared Euclidean distance between the projected datapoints, that is, kP x P x0 k2 equals (circle the correct option)
• Xd XD XD ⇣PijPij0(xj x0j)(xj0 x0j0)⌘ i=1 j=1 j0=1
• XdXD⇣Pij(xj x0j)⌘2 i=1 j=1
• XD "Xd Xd ⇣Pij(xj x0j)Pi0j⌘# j=1 i=1 i0=1
• XDXdXd⇣Pijxj Pi0jx0j⌘2 j=1 i=1 i0=1
• Xd Xd XD XD ⇣Pijxj Pi0j0x0j0⌘ i=1 i0=1 j=1 j0=1
• Xd XD XD ⇣Pijxj Pij0x0j0⌘ i=1 j=1 j0=1
(note: Pij denotes the ith row and j th column of the matrix P )
⇣ Xd XD i=1 j=1
Pijxj Pijx0j
COMS 4771 (Spring 2021) Exam 2 Page 8 of 15
(b) You want to study what effects does a random matrix P has on interpoint distances. For all 1 i d and 1 j D, let each entry Pij be drawn independently uniformly at random1 from the discrete set { ↵, +↵}.
i. (12 points) For two arbitrary (but fixed) datapoints x and x0 from the dataset X. Compute (simplify
as much as possible)
EP hkPx Px0k2i
ii. (5 points) For what value of ↵ we have: EP ⇥kPx Px0k2⇤ = kx x0k2 ?
1That is, Pij = ↵ with probability 0.5 and Pij = +↵ with probability 0.5.
COMS 4771 (Spring 2021) Exam 2 Page 9 of 15
4. [Regression and MLE (20 points)] After your great success in Exam 1, you have once again been invited to a game show. To keep things interesting, the game show host has invented a new game.
In this new game, the host first secretly picks a unit vector w 2 Rd (you may assume it is chosen uniformly at random from the unit sphere). Then a uniformly random integer k between 1 and 1000 (inclusive) is chosen and revealed. Finally, there are k trials that occur. Trial i 2 {1, 2, ..., k} happens as follows:
• You pick a unit vector xi 2 Rd.
• The quantity i := (w · xi) 2 is computed but is not revealed to you.
• Finally yi is drawn from an exponential distribution with parameter2 i and you are given yi dollars.
Note that when you pick xi you only get to see yi (you do not get to see i).
As a smart ML student, you decide to model this as an ML problem. Specifically, you decide to split your k trials into two groups. You use the first c trials to try to learn what the best xi to give is. You then use the remaining trials to try to maximize your earnings by repeatedly picking the (same) best possible input. We will focus on the first c trials (the learning or training phase). For this phase you decide to pick each xi uniformly at random from the unit sphere. We denote the set of inputs and outputs thus obtained by S := {(x1, y1), (x2, y2), ..., (xc, yc)}.
(a) (4 points) Armed with your new regression knowledge, you first decide to view this as a regression prob- lem. Show that the optimal L2 regressor for this problem is f⇤(x) = (w · x)2. In other words, letting x and y be random variables created as above (you pick x uniformly at random from the unit ball, y is then created as outlined in the bullet points above), show that f⇤(x) = (w · x)2 is the function that minimizes Ex,y(f(x) y)2 over all f : Rd ! R.
2 In other words p(yi ) = i e i yi . You can use any known facts about the exponential distribution without proof. Most useful is that the mean of the exponential distribution with parameter i is 1/ i.
COMS 4771 (Spring 2021) Exam 2 Page 10 of 15
(b) (4 points) Using the result from part (a), explain why using OLS to learn a prediction function from S is not a good idea.
(c) (5points)WhatsimplepreprocessingstepcouldbeusedonStofixtheissuein(b)andallowustoestimate f⇤(x) using the OLS algorithm? Make sure to explain why your suggested preprocessing step would help.
COMS 4771 (Spring 2021) Exam 2 Page 11 of 15
(d) (4 points) You also consider using MLE rather than OLS in the training phase. Let wMLE be the MLE of w given S. The best way to find wMLE is by optimizing the negative log likelihood function. Write down the negative log likelihood optimization problem for this particular data distribution and simplify it as much as possible. Make sure to completely specify the optimization problem. In particular, specify what variable(s) you optimize over, whether it is a minimization or a maximization problem, whether the optimization is subject to any constraints, etc. You do not need to solve the optimization problem, just simplify it as much as possible.
(e) (3 points) Assume that you were able to find an estimate wˆ of w using S. You now want to pick a single value x to use in all subsequent trials. If your only goal for those trials is to maximize profits what unit length x should you use? Why?
COMS 4771 (Spring 2021) Exam 2 Page 12 of 15
5. [Learning Theory] For a set I ✓ {1,2,...,n} we define a classifier hI : {0,1}n ! {0,1} as follows. For a binary vector ~x = (x1,...,xn) 2 {0,1}n,
hI(~x) = Xxi · 1[ i is even ]!mod 2, i2I
• 1[·] is the indicator function, and
• (·) mod 2 is the modulus 2 operation; thus, when the left expression (·) is even, the overall expression returns 0, and when the left expression (·) is odd the overall expression returns 1.
(a) Forthispart,assumen=2,I0 ={},I1 ={1},I2 ={2}andI3 ={1,2}.
i. (3points)Twoclassifiersh↵ andh areconsideredidenticalifforall~x2{0,1}2,h↵(~x)=h (~x).
Consider four classifiers hI0 , hI1 , hI2 , and hI3 . List all the classifiers which are identical to each other. (example response: classifiers hI0 , hI1 and hI2 are identical)
ii. (2 points) Consider an arbitrary (but fixed) dataset X ✓ {0, 1}2 . How many different ways can the classifier hI2 label the dataset X?
iii. (3 points) Define H := {hI1 , hI2 , hI3 } as the hypothesis class of three specified classifiers. What is the VC-dimension of H? Justify your answer for full credit.
Programming Help
COMS 4771 (Spring 2021) Exam 2 Page 13 of 15
(b) (12 points) For an arbitrary (but fixed) n 2 N = {1,2,3,…}. Define Hn := {hI : I ✓ {1,…,n}}. Provide the tightest upper and lower bounds for the VC-dimension of Hn.
COMS 4771 (Spring 2021) Exam 2 Page 14 of 15
[blank page 1 for scratch work]
COMS 4771 (Spring 2021) Exam 2 Page 15 of 15
[blank page 2 for scratch work]