Data Analytics and Machine Learning Group Department of Informatics
Technical University of Munich
Ecorrection
Place student sticker here
Exam: Examiner:
• Duringtheattendancecheckastickercontainingauniquecodewillbeputonthisexam.
• Thiscodecontainsauniquenumberthatassociatesthisexamwithyourregistrationnumber. • This number is printed both next to the code and to the signature field in the attendance
check list.
Machine Learning
IN2064 / Endterm Date: Thursday 13th February, 2020
Prof. Dr. ̈nnemann Time: 17:00 – 19:00
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 I
Working instructions
• This exam consists of 16 pages with a total of 10 problems.
Please make sure now that you received a complete copy of the exam.
• The total amount of achievable credits in this exam is 92 credits.
• Detaching pages from the exam is prohibited.
• Allowed resources:
– A4 sheet of handwritten notes (two sides)
– no other materials (e.g. books, cell phones, calculators) are allowed!
• Only write on the sheets given to you by supervisors. If you need more paper, ask the supervisors.
• Last two pages can be used as scratch paper.
• All sheets (including scratch paper) have to be returned at the end.
• Only use a black or a blue pen (no pencils, red or green pens)!
• Write your answers only in the provided solution boxes or the scratch paper.
• For problems that say “Justify your answer” you only get points if you provide a valid explana-
• For problems that say “Prove” you only get points if you provide a valid mathematical proof.
• If a problem does not say “Justify your answer” or “Prove” it’s sufficient to only provide the correct answer.
• Exam duration – 120 minutes.
Left room from to / Early submission at
– Page 1 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
Problem1 DecisionTrees(12credits)
You are given a dataset with points from three different classes and want to classify them based on a decision tree. The plot below illustrates the data points (class labels are indicated by the symbols [×, ◦, ▽]) and the decision boundaries of a decision tree.
0 a) Draw the corresponding decision tree. Make sure that you include the feature (x1 or x2) and threshold of
1 the split as well as the number of samples of each class that pass the corresponding inner node or leaf node.
x2 < 2? [10, 5, 0]
x2 < 3? [0, 4, 11]
x1 < 3? [10, 9, 11]
[0,5,0] [10,0,0] [0,0,11] [0,4,0]
Each inner node of the tree must include: feature and threshold, number of samples of each class (value = [....]).
Each leaf node must include: number of samples of each class (value = [....]).
Point: root , inner nodes , leave node
– Page 2 / 16 –
Sample Solution
Correction Notes
----PAGE----
b) Compute the Gini index of each node of your decision tree.
Note: Your answer may contain improper fractions (e.g. 33 ) 1
Gini index: iG(t) = i∈C πi(1 − πi) = 1 − i∈C πi2 Rootnode:iG(t)=1−102−92−112=598 ≈0.664
Leftchildofroot:iG(t)=1−102−52−02=100 ≈0.444 15 15 15 225
Rightchildofroot:iG(t)=1−02−42−112= 88 ≈0.391 15 15 15 225
Leftleafnode:iG(t)=1−02−52−02 =0 555
Left-middleleafnode:iG(t)=1−02−52−02 =0 555
Right-middle leaf node: iG (t ) = 1 − 10 2 − 0 2 − 0 2 = 0 10 10 10
Rightleafnode:iG(t)=1−02−42−02 =0(allleafnodes) 444
30 30 30 900
c) Assume you have a dataset with two-dimensional points from two different classes C1 and C2. The 0 points from class C1 are given by A = {(i,i2) | i ∈ {1...100}} ⊆ R2, while the points from class C2 are 1
3 Construct a decision tree of minimal depth that assigns as many data points as possible to the correct 4
class. Provide for each split the feature and corresponding thresholds. How many and which datapoints are missclassified?
B = {(i, 125 ) | i ∈ {1...100}} ⊆ R2. i
• Split root node based on feature 1 ≤ 5
• Split both child nodes based on feature 2 ≤ 25
One point (5, 25) is in both classes and missclassified. Decision three /Calculation of thresholds/ Picture.
– Page 3 / 16 –
Sample Solution
Correction Notes
----PAGE----
We are interested in estimating a discrete parameter z that can take values in {1, 2, 3, 4}.
• We place a categorical prior on z, that is p(z | π) = Categorical(z | π) with π = (0.1, 0.05, 0.85, 0.0). • We choose the following likelihood function: p(x | z) = Exponential(x | 2z) = 2z exp(−x2z).
• We have observed one sample x = 32.
What is the posterior probability that z is equal to 4, i.e. what is p(z = 4 | x, π)? Justify your answer.
Problem2 Probabilisticinference(4credits)
Using the Bayes formula (for correctly writing the Bayes formula)
p(z = 4 | x,π) ∝ p(x | z = 4)p(z = 4 | π) ∝ 24 exp(−32 · 24) · 0
Since the prior probability p(z = 4 | π) equals to zero, the posterior probability p(z = 4 | x, π) equals to
zero as well .
minus for each mistake in the formulas, even if the final answer is correct
Problem3 Probabilisticinference(8credits)
We are interested in estimating the parameter θ ∈ R of the following probabilistic model: p(x | θ) = exp(θ − x − exp(θ − x)).
5 estimate (MLE) of the parameter θ. Justify your answer. 6
We have observed a single sample x ∈ R drawn from the above model. Derive the maximum likelihood
The maximum likelihood estimate of θ is defined as
θMLE = arg max p(x | θ)
= arg max log p(x | θ) θ
= argmin−logp(x | θ) θ
solve for θ.
=! 1for correctly computing and simplifying = xfor the correct final answer
Therefore, θMLE = x.
∂ exp(θ) ∂θ −θ+x+exp(x)
exp(θ) ! =−1+exp(x)=0
exp(θ)
=argmin −θ+x+ θ
Clearly, this is a convex function of θ. To minimize, compute the derivative , set it to zero and
⇐⇒ exp(θ) exp(x)
– Page 4 / 16 –
Sample Solution
Correction Notes
----PAGE----
Mark correct answers with a cross
To undo a cross, completely fill out the answer option To re-mark an option, use a human-readable marking
Problem4 Regression(10credits)
The following five plots show five different regression models fitted to the same dataset. Your task is to assign each of the plots to the corresponding model.
For each subtask if only the correct option is chosen, otherwise 0. a) Polynomial regression (degree = 5), no regularization
× Model 2 Model 3 b) Polynomial regression (degree = 10), no regularization
× Model 5 Model 5 Model 5
Model 2 × Model 3
c) Polynomial regression (degree = 50), L2 regularization with λ = 103
Model 2 Model 3
d) Feed-forward neural network with ReLU activation functions, no regularization
e) Decision tree of depth 2
– Page 5 / 16 –
Sample Solution
Correction Notes
----PAGE----
Problem5 Convexoptimization(18credits)
ConsiderthesetX={x∈RD :∥x∥2≤1andxi≥0foralli=1,...,D}with1
iftheprojectionπX iscorrectforthepointsx∈/X. • andfinallyπX(x)=x forx ∈X.
for the case x ∈ X .
– Page 6 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
From now on we consider the setting with an arbitrary 1 < D ∈ N. c) Prove that X is convex.
X={x∈RD :∥x∥2≤1,xi≥0foralli=1,...,D}
={x∈RD : ∥x∥2 ≤1}∩{x∈RD : 0≤xi ≤1foralli=1,...,D}
unit ball B1(0) cube [0,1]D
Therefore, X is an intersection of two convex sets and hence convex.
If proven by definition of convexity:
for proving that an intermediate point xλ satisfies ∥xλ∥2 ≤ 1, for proving that (xλ)i ≥ 0 for all i.
d) Fill in the space in the box below using mathematical notation with a description of the vertices of X . Note
0 that just writing down the definition of vert(X ) will not be sufficient. 1 2
{x∈RD :∥x∥2=1andxi≥0foralli=1,...,D}∪{0}
– Page 7 / 16 –
Sample Solution
Correction Notes
----PAGE----
e) Find the maximum of the following constrained optimization problem. Justify your answer, all properties of 1 the objective function and X that you use should be clearly stated and derived from the previous tasks or
4 Hint: results from c) and d) might help you.
results considered in the course.
Hint: for arbitrary c ∈ RD the maximum of the constrained problem maxx cT x subject to ∥x∥2 = 1 is ∥c∥2. 8 maximize x +e∥x∥2
subject to x ∈ X
We will utilize the following results:
• X is convex (from c).
• Function f : RD → R with f(x) = Di=1 xi + e∥x∥2 is convex since the function h : RD → [0,∞) with h(x) = ∥x∥2 is convex and the function g : [0,∞) → R with g(x) = ex is convex and non- decreasing. Therefore, using the convexity preserving operations we see that g(h(x)) = e∥x∥2 and linear l(x) = Di=1 xi are both convex and hence f(x) = l(x) + g(h(x)) is convex as well.
• Maximum of a convex function over a convex domain is attained at one of its vertices (from the lecture).
for a correct argumentation that it is enough to consider the optimization problem on vert(X) only.
Now it is sufficient to look for the maximum value of f over vert(X ) that consists of {0} and a set where foreachpointitholdsthat∥x∥2 =1andxi ≥0foralli(fromd). Case0:x=0,thenf(x)=f(0)=0+e0 =1.
Case1:x∈B:={x∈RD :∥x∥2=1andxi≥0foralli=1,...,D}thenthelargestvaluethatfattains at these points is
max{f(x) : ∥x∥2 =1,xi ≥0}=e+max{c x : ∥x∥2 =1}wherec=.. 1
√√ UsingtheprovidedhintwegetthatthemaximumoffoverBise+∥c∥ =e+ D.Since D+e>1
wegetthatthemaximumoff overX is
if all verteces were considered and the correct results were obtained using valid argu- mentation.
– Page 8 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
Problem6 Kernels(10credits)
Let A ∈ RD×D be a positive semi-definite matrix and consider for p ∈ N the following function k:RD×RD→R, k(x1,x2)=(x1TAx2+1)p.
a) Prove that k is a valid kernel using kernel preserving operations known from the course.
Solution 1:
• k1(x1,x2) = x1TAx2 is a kernel since A is positive semi-definite (kernel is known from the course).
• k2(x1, x2) = 1 is a kernel since k2(x1, x2) = φ(x1)T φ(x2) for φ(x) = 1.
• k3(x1, x2) = x1T Ax2 + 1 is a kernel since it is a sum of two kenels (kernel preserving operation
known from the lecture)
• Finally, subsequently applying the rule that a product of two kernels is a kernel for k3 ∗ k3 we
get that k is a kernel as well. Solution 2:
• k1(x1, x2) = x1T Ax2 is a kernel since A is positive semi-definite. Therefore there exists a map φA suchthatk1(x1,x2)=φA(x1)TφA(x2).
• Polynomialkernelkp(x1,x2)=(x1Tx2+1)p isakernel.ifthisfactisusedlaterinthesolution
• Using the composition rule (kernel preserving operation known from the lecture) we know thatkp(φA(x1),φA(x2))=(x1TAx2 +1)p =k(x1,x2)isagainakernel.
b) For the special case of p = 2 and A = I (identity matrix) write down the corresponding feature map φ:RD →RM suchthat
k(x1,x2) = φ(x1)Tφ(x2). What is the dimension M of the feature space in this case?
0 1 2 3 4 5 6
2xD x1x1 x1x2
φ(x)= .
and M=1+D+D2.
x1xD .
. xx
. xD xD
Comment: in this case we get the quadratic kernel and the feature map includes constant, linear and all quadratic terms we can produce from the initial features.
for the constant term, for the linear terms, for the quadratic terms, -if M is wrong, -if only the case of D = 2 is considered.
– Page 9 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
Problem7 Deeplearning(8credits)
The code snippet below shows an implementation of two functions f : RN × RN → R and g : R → R. Given two input vectors x ∈ RN and y ∈ RN, we perform the following computations:
z = f(x,y) t = g(z)
The code below uses backpropagation to compute ∂t and ∂t (similarly to how we did it in Tutorial 9: ∂x ∂y
Deep Learning I). However, some code fragments are missing. Your task is to complete the missing code fragments.
Note: It’s also fine to write your answer using pseudocode (we won’t deduct points for small Python syntax errors, etc.).
def forward(self, x, y):
self.cache = (x, y) ################################################## # MISSING CODE FRAGMENT #1 ################################################## return out
def backward(self, d_out):
# x, y are np.arrays of shape [N]
x, y = self.cache
N = len(x)
# np.ones(N) returns a vector of ones of shape [N] d_y = (x + np.ones(N)) * d_out
d_x = (y – np.ones(N)) * d_out
return d_x , d_y
def sigmoid(a):
return 1 / (1 + np.exp(-a))
def forward(self, z):
self.cache = z return sigmoid(z)
def backward(self, d_out): z = self.cache
################################################## # MISSING CODE FRAGMENT #2 ################################################## return d_z
# Example usage
x = np.array([1., 2., 3])
y = np.array([-2., 3., -1.])
z = f.forward(x, y)
t = g.forward(z)
d_z = g.backward(1.0)
d_x, d_y = f.backward(d_z)
– Page 10 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
a) Complete the MISSING CODE FRAGMENT #1
Two points for xT y , one point for each of the sums
out = np.dot(x, y) + np.sum(y) − np.sum(x)
N = len(x)
ones = np.ones(N)
out = np.dot(x, y) − np.dot(x, ones) + np.dot(y, ones)
N = len(x)
ones = np.ones(N)
out = np.dot(x + ones, y − ones)
b) Complete the MISSING CODE FRAGMENT #2
Option 1: Option 2:
d_z = sigmoid(z) * sigmoid(−z) * d_out
d_z = sigmoid(z) * (1 − sigmoid(z)) * d_out
Three points for the derivative of sigmoid , one point for not forgetting d_out
– Page 11 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
0 1 2 3 4 5 6 7 8
You want to perform linear regression on a data set with features X ∈ RN×D and targets y ∈ RN. Assume that you have already computed the SVD of the feature matrix X = UΣVT. Additionally, assume that X has full rank.
Show how we can compute the optimal linear regression weights w⋆ in O(ND2) operations by using the result of the SVD.
Hint: Matrix operations have the following asymptotic complexity
• Matrix multiplication AB for arbitrary A ∈ RP×Q and B ∈ RQ×R takes O(PQR)
• Matrix multiplication AD for an arbitrary A ∈ RP×Q and a diagonal D ∈ RQ×Q takes O(PQ) • Matrix inversion C−1 for an arbitrary matrix C ∈ RM×M takes O(M3)
• Matrix inversion D−1 for a diagonal matrix D ∈ RM×M takes O(M)
Problem8 SVDandlinearregression(8credits)
w∗ = (XTX)−1XTy =
= ((UΣVT )T (UΣVT ))−1(UΣVT )T y = = (VΣUT UΣVT )−1VΣUT y =
= (VΣ2VT)−1VΣUTy =
= (VT )−1Σ−2VT VΣUT y =
= VΣ−2VTVΣUTy =
Multiplication a = UT y takes O(N · D · 1) Multiplication b = Σ−1a takes O(D) Multiplication w = Vb takes O(D2)
In total, O(ND + D + D2) = O(ND) if N > D.
for w∗ = (XT X)−1XT y, for UT U = I
for (VΣ2VT )−1 = (VT )−1Σ−2V−1 for (VT )−1Σ−2V−1 = VΣ−2VT
forVTV =I
for the final answer w∗ = VΣ−1UT y minus for any mistake
for correct big-O analysis
There is a mistake in the task, saying “O(ND2)” instead of “O(ND)”. The task was graded taking this into account.
– Page 12 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
Problem9 K-Means(10credits)
Let γi ∈ RD for i = 1, … , K be a set of K points more than 4 apart, i.e. ∥γi − γj ∥2 > 4 for all i ̸= j. Consider K non-empty datasets Xi each contained within a unit ball around γi , i.e. ∥x − γi ∥2 ≤ 1 for all x ∈ Xi . Let X = Ki=1 Xi be the combined dataset.
Now consider a centroid initialization procedure similar to k-means++, though it deterministically chooses the data point farthest away from all previous centroids. That means it initializes the cluster centers μi as
random sample from X if i = 1
μi = argmax min ∥x−μ∥ ifi∈{2,…,K}
x∈X j∈{1,…,i−1} j 2
a) Explain why this deterministic k-means++ initialization of K clusters assigns each μi to a different ball, i.e. 0 ′′
fori̸=jsuchthatμi ∈Xi′ andμj ∈Xj′ itholdsthati ̸=j . 1 2 3 4 5
The first centroid is placed into a random ball. Now assume that a subsequent centroid μj would be placed into the same ball as some previous μi . That means that ∥μj − μi ∥2 ≤ 2 because each Xi ′ is contained within a ball of radius 1. However, because we are placing K centroids into K balls and one ball has been chosen twice, there is at least one ball Xi ′ that does not have a centroid in it. Xi ′ is non-empty, so it has an element x ∈ Xi ′ which is more than 2 away from any previously chosen centroid because the ball centers are more than 4 apart and the data points can deviate at most 1 from their closest ball center. But that contradicts the construction of μj as the data point that is the furthest away from any previously chosen centroid, in particular μi . Therefore, every centroid is placed in a different ball.
if the student shows an understanding that the statement is true because the balls are far apart in some way
if the student works out the crux of the argument clearly, i.e. ≤ 2 vs. > 2 or in words “at most 2” vs. “more than 2”
-for mistakes, erroneous statements or arguments that do not make explicit use of ≤ 2 vs. > 2 and thus would work in the setting ∥γi − γj ∥2 > 3.9 just the same
b) Assuming a), explain why k-means clustering of X with K clusters and our deterministic k-means++ initialization recovers the underlying structure of the data, i.e. all data points x ∈ Xi will be assigned to the 1
same centroid for all i.
Without loss of generality, let the centroids be assigned such that μi ∈ Xi for all i. Then each data point x ∈ Xi will be assigned to μi because ∥μi − x∥2 ≤ 2 and ∥μj − x∥2 > 2 for all i and j ̸= i, see a). Because updating centroid μi is a convex combination of data points x ∈ Xi , μi stays within the bounding ball of Xi for any i. So the assignments stay the same and k-means terminates in the next iteration.
for arguing convincingly that each x is assigned to the centroid in its ball because again ≤ 2 vs. > 2
for stating that and why the algorithm converges/terminates
-for mistakes, erroneous statements or arguments that do not make explicit use of ≤ 2 vs. > 2 and thus would work in the setting ∥γi − γj ∥2 > 3.9 just the same
– Page 13 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
a) What is the robustness to post-processing property of Differential Privacy?
Problem10 DifferentialPrivacy&Fairness(4credits)
You can apply any function on the output from a ε-DP mechanism and the new output remains ε-DP
as long as you do not touch again the data.
Only one point if “do not touch again the data” is missing.
b) Suppose that we require that the same percentage of applicants from two different groups must receive a
1 loan. What fairness criterion are we implementing? What is the biggest downside/con of this criterion? 2
Demographic Parity, also called Independence and Statistical Parity. Any of the following downsides are acceptable:
• Rules out the perfect predictor when base rates are different across groups.
• Laziness: We can trivially satisfy the criterion if we give loan to qualified people from one group
and random people from the other.
• Too strong, if one group is much more likely to repay compared to the other group.
– Page 14 / 16 –
Sample Solution
Correction Notes
—-PAGE—-
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike out invalid solutions.