COMS 4771 (Fall 2017) Final Exam December 07, 2017

COMS 4771 (Fall 2017) Final Exam December 07, 2017
Name & UNI:
1. (15 points) State True or False. No justification needed!
Please show all your work! Answers without supporting work will not be given credit. Write your answers in spaces provided.
There are total 13 pages including two blank pages at the end for scratch work.
You have 1 hour and 15 minutes to complete this exam.
You may use any result from the lectures or the homeworks without proof.
for correct, 0 for blank, -1 for incorrect answer)
For any set of n random variables X1 , . . . , Xn , the joint distribution P (X1 , . . . , Xn ) =
Qni=1 P (Xi|X1, . . . , Xi−1).
If two random variables are conditionally independent, then they are not neces- sarily independent; but if the two random variables are independent then they are necessarily conditionally independent as well.
In a Hidden Markov Model, the observed variable Xt−1 (at time t − 1) is condi- tionally independent of the observed variable Xt+1 (at time t + 1) given Xt.
Consider the unit Lp-ball in Rd, that is the set {x ∈ Rd : ∥x∥p ≤ 1} =: Bp. Then, B1 ⊆B2 ⊆B∞.
A non-linear kernel transform in sufficiently high dimensions can always achieve
zero test error.
The dual variables (αi) in Support Vector Machines (SVMs), take non-zero values only if the corresponding datapoints xi are in fact “support” vectors that dictate the margin.
VC dimension of decision trees in R2 is infinite.
The decision boundary induced by a Logistic regression classifier on a two-class problem is always linear.

COMS 4771 (Fall 2017) Final Exam Page 2 of 13
The Lloyd’s algorithm for k-means clustering for k = 2 can give solutions that are arbitrarily bad in terms of the clustering cost compared to the optimal (k = 2) k-means solution.
The maximum likelihood setting of the parameters for a mixture model often yields undesirable results.
A directed graphical model on n variables that has a structure of a fully con- nected directed acyclic graph (DAG) admits no independencies (conditionally or unconditonally) amongst the n variables.
The notation “A⊥B | C” means “A and B are independent given C”. Then X⊥Y | W,Z and X⊥W | Y,Z =⇒ X⊥W,Y | Z.
L2 (Euclidean) distances can always be computed efficiently in Kernel space via the kernel trick.
For any hypothesis class F, it must be the case that VC(F) ≤ log2(|F|).
When compared to batch learning, active learning can significantly reduce the num- ber of labelled samples needed to learn a concept to a desired level of accuracy.

COMS 4771 (Fall 2017) Final Exam Page 3 of 13
2. [Maximum Likelihood Estimation of fully observed HMMs (15 points)] Let the distribu- tion of (Xt,Yt)t∈{1,2,…} be from a discrete space HMM, where each Yi (the hidden state at time t) takes in values from [K] := {1, 2, . . . , K} and each Xt (the observation at time t) takes values in [D] := {1,2,…,D}. Recall that the HMM parameters θ = (π,A,B) have the following semantics: P[Y1 =i]=πi,P[Yt+1 =j|Yt =i]=Ai,j andP[Xt =j|Yt =i]=Bi,j.
Suppose you have as training data a labeled sequence ((x1, y1), (x2, y2), . . . , (xT , yT )) ∈ ([D] × [K])T . What is the maximum likelihood estimation of the HMM parameters θ given this data? (Hint: if needed, use the convention 0/0 = 0 in deriving your estimates.)

COMS 4771 (Fall 2017) Final Exam Page 4 of 13
3. [Optimization of sparse regression] Let X ∈ Rn×d be a data matrix with the ith row of X is the ith data observation xi ∈ Rd. Let y ∈ Rn be the real-valued labels. Let W := {w ∈ Rd : w has at most one non-zero entry}. Suppose you want to find the weight vector wˆ ∈ W that minimizes the objective function f(w) := ∥y − Xw∥2 over all w ∈ W.
(a) (5 points) Formulate this as an optimization problem. (Make sure to clearly indicate the dimen- sion and the variables).
(b) (3 points) Is this a convex optimization problem? (why or why not)?
Computer Science Tutoring
COMS 4771 (Fall 2017) Final Exam Page 5 of 13
(c) (10 points) Describe an algorithm for computing such a vector wˆ. Make sure that your pseu- docode is clear and precise, and specify what is exactly returned by your algorithm. (Hint: all but one entry of wˆ are zero).
(d) (2 points) What is the time complexity of your algorithm (give it in terms of the parameters n and d)?

COMS 4771 (Fall 2017) Final Exam Page 6 of 13
4. [Principal Components Analysis (PCA) and beyond]
Recall from lecture that given a data matrix X = x1 · · · xn , where each observation xi ∈ Rd ,
the best k-dimensional linear mapping that minimizes the squared reconstruction error is given by the eigenvectors corresponding to the top k eigenvalues of the outer product matrix XXT. This is also called the k-dimensional PCA subspace.
(a) Suppose you are informed that the function to compute the eigenvectors and eigenvalues in your favorite language is buggy (so you cannot use this function). As an alternative you explore the language documentation and find a function that has the the ability to decompose any d × n matrix X into a summation over three sets variables: σi (a scalar), ui (a vector in Rd), and vi (a vector in Rn). That is, the data matrix X can be written as:
X = X σ i u i v iT ,
These variables have special properties:
• The scalars σi’s are all non-negative, such that: 0 ≤ σ1 ≤ . . . ≤ σd.
• The vectors u1, . . . , ud are orthonormal. That is, ∥ui∥ = 1 and ui · uj = 0 (for i ̸= j).
• The vectors v1, . . . , vd are also orthonormal. That is, ∥vi∥ = 1 and vi · vj = 0 (for i ̸= j).
i. (5 points) Compute XXT in terms of σi, ui and vi. Simplify your answer as much as possible.

COMS 4771 (Fall 2017) Final Exam Page 7 of 13
ii. (5 points) Using the definition of eigenvector of a matrix A: “Ax = λx”, show that each ui is an eigenvector of the outerproduct XXT. What are the corresponding eigenvalues?
iii. (2 points) Which k eigenvectors (from u1, . . . , ud) of XXT should be used to form a k- dimensional PCA subspace?

COMS 4771 (Fall 2017) Final Exam Page 8 of 13
(b) (3 points) PCA yields a subspace that gives the best reconstruction error, but this subspace can be arbitrarily bad for some predictive tasks such as classification. Suppose we want to do linear classification on binary labelled data in R2. To reduce representational complexity, we wish to project this data onto a 1D subspace via PCA and perform linear classification in 1D.
Depict an example binary labelled dataset in R2 such that even though there exists a 1D projection of the dataset that can perfectly linearly separate the two classes, the best linear separator on the 1D PCA projection will give poor classification accuracy. (You must justify your answer why PCA projection would not pick a good label separating subspace by outlining the kinds of properties your depicted dataset must have.)

COMS 4771 (Fall 2017) Final Exam Page 9 of 13
5. [Bayes Nets]
(a) (5 points) Draw a Bayes net over the random variables {A, B, C, D} where the following condi- tional independence assumptions hold. Here, X⊥Y |Z means X is conditionally independent of Y given Z, and X ̸⊥ Y |Z means X and Y are not conditionally independent given Z, and ∅ stands for the empty set.
(b) (2 points) Write a simple expression for the joint distribution over the variables {A, B, C, D} that captures all the conditional (in)dependencies asserted by the Bayes net you depicted in part (a).
Pr[A, B, C, D] =
(c) (3 points) List all the conditional indepencencies asserted by the Bayes net you depicted in part (a) but with all the arrows reversed.

COMS 4771 (Fall 2017) Final Exam Page 10 of 13
6. [A better output Perceptron algorithm guarantee]
In class, we saw that when the training sample S is linearly separable with a maximum margin γ > 0, then the Perceptron algorithm run cyclically over S is guaranteed to converge after T ≤ R/γ2 updates, where R is the radius of the sphere containing the sample points. This does not guarantee however that the hyperplane solution returned by Perceptron, i.e. wT achieves a margin close to γ.
(a) (5 points) Show an example training dataset S in R2 that has margin γ, and an order of updates made by the Perceptron algorithm where the hyperplane solution returned has arbitrarily bad margin on S.
(b) Consider the following modification to the perceptron algorithm:
Modified Perceptron Algorithm
Input: training dataset S = (xi, yi)i=1,…,n Output: learned vector w
– Initialize w0 := 0, t := 0
– while there exists an example (x, y) ∈ S, such that y (w(· hx)h≤ 0 2y(w · x) ≤ γ∥w ∥
– setwt+1:=wt+yx – sett:=t+1
– return wt.
i. (2 points) If the Modified Perceptron Algorithm (MPA) terminates after T rounds, what margin guarantee is achieved by the hyperplane wT returned by MPA? Justify your answer.
original condition
modified condition
z }| {z }| {
((t hh t t

COMS 4771 (Fall 2017) Final Exam Page 11 of 13
ii. (3 points) It turns out that for a training sample S of margin γ one can also show that at any iteration t where the Modified Perceptron Algorithm (MPA) makes a mistake, the following is true.
A. wt · w∗ ≥ (wt−1 · w∗) + γ B.∥wt∥≤4R2 +3tγ
From properties A and B, what mistake bound can you derive for MPA? That is, bound the
maximum iterations T, or equivalently, bound the number of mistakes made by MPA.

COMS 4771 (Fall 2017) Final Exam Page 12 of 13
[blank page 1 for scatch work]
Code Help, Add WeChat: cstutorcs
COMS 4771 (Fall 2017) Final Exam Page 13 of 13
[blank page 2 for scatch work]
浙大学霸代写 加微信 cstutorcs