Data Analytics and Machine Learning Group Department of Informatics
Technical University of Munich
Place student sticker here
Graded Exercise: Examiner:
• Duringtheattendancecheckastickercontainingauniquecodewillbeputonthisexam.
• Thiscodecontainsauniquenumberthatassociatesthisexamwithyourregistrationnumber.
• Thisnumberisprintedbothnexttothecodeandtothesignaturefieldintheattendancechecklist.
Machine Learning
IN2064 / Endterm Date: Tuesday 16th February, 2021
Prof. Dr. ̈nnemann Time: 11:00 – 13:00
Working instructions
• This graded exercise consists of 36 pages with a total of 10 problems.
Please make sure now that you received a complete copy of the graded exercise.
• The total amount of achievable credits in this graded exercise is 48.
• This document is copyrighted and it is illegal for you to distribute it or upload it to any third-party websites.
• Do not submit the problem descriptions (this document) to TUMexam
Left room from to / Early submission at
– Page 1 / 36 –
—-PAGE—-
Imagine a disease spreading across a small village. To make accurate forecasts for the necessary hospital beds, you want to estimate the severity of the disease with the following model. Let s be a measure for the severity of a disease. We assume a priori that s follows a standard normal distribution.
s ∼ N(0,1) ∝ exp−s2
The severity probabilistically influences the required days of hospital care t for a patient according to
t | s ∼ Exp(λ) = λexp(−λt) where λ = s2.
After some time, you were able to collect N = 5 data points. The respective patients left the hospital after 3, 7, 3, 2
and 4 days.
Derive an a-posteriori most likely value of the severity s considering these observations. Justify your answer. Note: You can assume that s ̸= 0, that is, you can safely divide by s if necessary.
Problem 1: Probabilistic Inference (Version A) (4 credits)
– Page 2 / 36 –
—-PAGE—-
Problem 1: Probabilistic Inference (Version B) (4 credits)
Imagine a disease spreading across a small village. To make accurate forecasts for the necessary hospital beds, you want to estimate the severity of the disease with the following model. Let s be a measure for the severity of a disease. We assume a priori that s follows a standard normal distribution.
s ∼ N(0,1) ∝ exp−s2 2
The severity probabilistically influences the required days of hospital care t for a patient according to t | s ∼ Exp(λ) = λexp(−λt) where λ = s2.
After some time, you were able to collect N = 3 data points. The respective patients left the hospital after 10, 11 and 5 days.
Derive an a-posteriori most likely value of the severity s considering these observations. Justify your answer. Note: You can assume that s ̸= 0, that is, you can safely divide by s if necessary.
– Page 3 / 36 –
—-PAGE—-
Imagine a disease spreading across a small village. To make accurate forecasts for the necessary hospital beds, you want to estimate the severity of the disease with the following model. Let s be a measure for the severity of a disease. We assume a priori that s follows a standard normal distribution.
s ∼ N(0,1) ∝ exp−s2
The severity probabilistically influences the required days of hospital care t for a patient according to
t | s ∼ Exp(λ) = λexp(−λt) where λ = s2.
After some time, you were able to collect N = 4 data points. The respective patients left the hospital after 8, 12, 9
and 6 days.
Derive an a-posteriori most likely value of the severity s considering these observations. Justify your answer. Note: You can assume that s ̸= 0, that is, you can safely divide by s if necessary.
Problem 1: Probabilistic Inference (Version C) (4 credits)
– Page 4 / 36 –
—-PAGE—-
Problem 2: k-nearest neighbors (Version A) (4 credits)
Consider a k-nearest neighbor classifier using Euclidean distance on a binary classification task. The prediction for an instance is the majority class of its k-nearest neighbors. The training set is shown on Figure 4.1.
Figure 4.1: A red + denotes instances from class 1, and a blue – denotes instances from class 2.
a) Specify one value of k that minimizes the leave-one-out cross-validation (LOOCV) error. Please consider only odd values of k (e.g. 1, 3, 5, 7, … ) to avoid ties. What is the resulting error, i.e. the number of misclassified data points?
b) Imagine that the training dataset contains one additional point with coordinates (1, 2) and label +. Would this change the decision boundary for a 1-nearest neighbor classifier? Why or why not?
c) If your answer above was no what is the shortest distance that you need to move (1, 2) such that it changes the decision boundary? If your answer above was yes what is the shortest distance that you need to move (1, 2) so it does not change the decision boundary?
Write down the new coordinates after moving and specify the shortest distance.
0 1 2 3 4 5 6 7 8 9 10
– Page 5 / 36 –
—-PAGE—-
Problem 2: k-nearest neighbors (Version B) (4 credits)
Consider a k-nearest neighbor classifier using Euclidean distance on a binary classification task. The prediction for an instance is the majority class of its k-nearest neighbors. The training set is shown on Figure 5.1.
Figure 5.1: A red + denotes instances from class 1, and a blue – denotes instances from class 2.
a) Specify one value of k that minimizes the leave-one-out cross-validation (LOOCV) error. Please consider only odd values of k (e.g. 1, 3, 5, 7, … ) to avoid ties. What is the resulting error, i.e. the number of misclassified data points?
b) Imagine that the training dataset contains one additional point with coordinates (1, 2) and label +. Would this change the decision boundary for a 1-nearest neighbor classifier? Why or why not?
c) If your answer above was no what is the shortest distance that you need to move (1, 2) such that it changes the decision boundary? If your answer above was yes what is the shortest distance that you need to move (1, 2) so it does not change the decision boundary?
Write down the new coordinates after moving and specify the shortest distance.
0 1 2 3 4 5 6 7 8 9 10
– Page 6 / 36 –
—-PAGE—-
Problem 2: k-nearest neighbors (Version C) (4 credits)
Consider a k-nearest neighbor classifier using Euclidean distance on a binary classification task. The prediction for an instance is the majority class of its k-nearest neighbors. The training set is shown on Figure 6.1.
Figure 6.1: A red + denotes instances from class 1, and a blue – denotes instances from class 2.
a) Specify one value of k that minimizes the leave-one-out cross-validation (LOOCV) error. Please consider only odd values of k (e.g. 1, 3, 5, 7, … ) to avoid ties. What is the resulting error, i.e. the number of misclassified data points?
b) Imagine that the training dataset contains one additional point with coordinates (1, 2) and label -. Would this change the decision boundary for a 1-nearest neighbor classifier? Why or why not?
c) If your answer above was no what is the shortest distance that you need to move (1, 2) such that it changes the decision boundary? If your answer above was yes what is the shortest distance that you need to move (1, 2) so it does not change the decision boundary?
Write down the new coordinates after moving and specify the shortest distance.
0 1 2 3 4 5 6 7 8 9 10
– Page 7 / 36 –
—-PAGE—-
Problem 2: k-nearest neighbors (Version D) (4 credits)
Consider a k-nearest neighbor classifier using Euclidean distance on a binary classification task. The prediction for an instance is the majority class of its k-nearest neighbors. The training set is shown on Figure 7.1.
Figure 7.1: A red + denotes instances from class 1, and a blue – denotes instances from class 2.
a) Specify one value of k that minimizes the leave-one-out cross-validation (LOOCV) error. Please consider only odd values of k (e.g. 1, 3, 5, 7, … ) to avoid ties. What is the resulting error, i.e. the number of misclassified data points?
b) Imagine that the training dataset contains one additional point with coordinates (1, 2) and label -. Would this change the decision boundary for a 1-nearest neighbor classifier? Why or why not?
c) If your answer above was no what is the shortest distance that you need to move (1, 2) such that it changes the decision boundary? If your answer above was yes what is the shortest distance that you need to move (1, 2) so it does not change the decision boundary?
Write down the new coordinates after moving and specify the shortest distance.
0 1 2 3 4 5 6 7 8 9 10
– Page 8 / 36 –
—-PAGE—-
Problem 3: Linear Regression (Version A) (6 credits)
Consider a dataset D = {(x(i),y(i))}Ni=1, with x(i) ∈ RD,y(i) ∈ R and centered features so that N1 Ni=1 x(i) = 0. During
(i) (i) N preprocessing, we absorb the bias as a constant feature to produce the transformed dataset D = {(x , y )}i=1
where we map each (x(i), y(i)) to
(i) x(i) D+1 (i) (i) x=1∈R andy=y.
We want to perform ridge regression with regularization strength λ to find the optimal weight vector w Reminder: In ridge regression we optimize the loss function
T (i) (i) 2 w x − y + 2 ∥ w ∥ 2 .
L ( w ) = 2
a) Derive the closed form solution for the last element of the weight vector wD+1 corresponding to the absorbed bias 0
obtained from ridge regression on D.
wi = wi for i ∈ {1, … , D}. You do not need to derive the closed-form solution for w .
(i) (i) N
b) Propose an alternative preprocessing step, i.e. define an alternative transformed dataset D = {(x ,y )}i=1, 0
such that the optimal ridge regression vector w ∈ R on D is equivalent to w obtained on D. Justify that ridge
∗D∗1 regression on your preprocessed data D finds an optimal w ∈ R that is equal to w in the first D elements, i.e.
– Page 9 / 36 –
—-PAGE—-
obtained from ridge regression on D.
Problem 3: Linear Regression (Version B) (6 credits)
Consider a dataset D = {(x(i),y(i))}Ni=1, with x(i) ∈ RD,y(i) ∈ R and centered features so that N1 Ni=1 x(i) = 0. During
(i) (i) N preprocessing, we absorb the bias as a constant feature to produce the transformed dataset D = {(x , y )}i=1
where we map each (x(i), y(i)) to
(i) x(i) D+1 (i) (i) x=1∈R andy=y.
We want to perform ridge regression with regularization strength λ to find the optimal weight vector w Reminder: In ridge regression we optimize the loss function
1∗D∗ regression on your preprocessed data D finds an optimal w ∈ R that is equal to w
in the first D elements, i.e.
wi = wi for i ∈ {1, … , D}. You do not need to derive the closed-form solution for w .
T (i) (i) 2 w x − y + 2 ∥ w ∥ 2 .
L ( w ) = 2
0 a) Derive the closed form solution for the last element of the weight vector wD+1 corresponding to the absorbed bias
(i) (i) N
0 b) Propose an alternative preprocessing step, i.e. define an alternative transformed dataset D = {(x ,y )}i=1,
such that the optimal ridge regression vector w ∈ R on D is equivalent to w obtained on D. Justify that ridge
– Page 10 / 36 –
—-PAGE—-
Problem 4: Naive Bayes (Version A) (6 credits)
We have collected the dataset shown in the table below with the continuous feature x1, and the discrete feature x2 that can take either of the values yes or no. Each data point is labeled as one of three classes 1, 2 or 3 denoted by y.
Table 10.1: Naive Bayes Data (each column is one data point) (1) (2) (3) (4) (5) (6) (7)
x1 0.0 2.0 -2.0 -1.0 3.0 4.0 6.0 x2 yes no no yes no yes yes
a) Set up a naive Bayes classifier (choose likelihoods and their parameterization) for the data in Table 10.1 using Normal distributions with fixed variance of 1 and Bernoulli distributions. Compute the maximum likelihood estimate of all parameters θ required for naive Bayes.
b) You observe a new data point x(b) = 1 yes. Compute the unnormalized posterior over classes p(y(b) | x(b), θ) for x(b). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
c) Next, you get another data point x(d) = n/a n/a, where all values are missing. Compute the unnormalized posterior over classes p(y(d) | x(d), θ) for x(d). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
d) Finally, you see a data point x(c) = n/a no, i.e. the features are only partially known. Compute the unnormalized posterior over classes p(y(c) | x(c), θ) for x(c). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
– Page 11 / 36 –
—-PAGE—-
Problem 4: Naive Bayes (Version B) (6 credits)
We have collected the dataset shown in the table below with the continuous feature x1, and the discrete feature x2 that can take either of the values yes or no. Each data point is labeled as one of three classes 1, 2 or 3 denoted by y.
Table 11.1: Naive Bayes Data (each column is one data point) (1) (2) (3) (4) (5) (6) (7)
x1 0.0 2.0 -2.0 -1.0 3.0 4.0 6.0 x2 yes no no yes no yes yes
a) Set up a naive Bayes classifier (choose likelihoods and their parameterization) for the data in Table 11.1 using Normal distributions with fixed variance of 1 and Bernoulli distributions. Compute the maximum likelihood estimate of all parameters θ required for naive Bayes.
b) You observe a new data point x(b) = 2 yes. Compute the unnormalized posterior over classes p(y(b) | x(b), θ) for x(b). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
c) Next, you get another data point x(d) = n/a n/a, where all values are missing. Compute the unnormalized posterior over classes p(y(d) | x(d), θ) for x(d). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
d) Finally, you see a data point x(c) = n/a no, i.e. the features are only partially known. Compute the unnormalized posterior over classes p(y(c) | x(c), θ) for x(c). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
– Page 12 / 36 –
—-PAGE—-
Problem 4: Naive Bayes (Version C) (6 credits)
We have collected the dataset shown in the table below with the continuous feature x1, and the discrete feature x2 that can take either of the values yes or no. Each data point is labeled as one of three classes 1, 2 or 3 denoted by y.
Table 12.1: Naive Bayes Data (each column is one data point) (1) (2) (3) (4) (5) (6) (7)
x1 -3.0 -1.0 2.0 2.0 3.0 3.0 6.0 x2 yes no no no no yes yes
a) Set up a naive Bayes classifier (choose likelihoods and their parameterization) for the data in Table 12.1 using Normal distributions with fixed variance of 1 and Bernoulli distributions. Compute the maximum likelihood estimate of all parameters θ required for naive Bayes.
b) You observe a new data point x(b) = yes
for x(b). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
c) Next, you get another data point x(d) = n/a n/a, where all values are missing. Compute the unnormalized posterior over classes p(y(d) | x(d), θ) for x(d). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
d) Finally, you see a data point x(c) = n/a no, i.e. the features are only partially known. Compute the unnormalized posterior over classes p(y(c) | x(c), θ) for x(c). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
1. Compute the unnormalized posterior over classes p(y(b) | x(b), θ)
– Page 13 / 36 –
—-PAGE—-
Problem 4: Naive Bayes (Version D) (6 credits)
We have collected the dataset shown in the table below with the continuous feature x1, and the discrete feature x2 that can take either of the values yes or no. Each data point is labeled as one of three classes 1, 2 or 3 denoted by y.
Table 13.1: Naive Bayes Data (each column is one data point) (1) (2) (3) (4) (5) (6) (7)
x1 -3.0 -1.0 2.0 2.0 3.0 3.0 6.0 x2 yes no no no no yes yes
a) Set up a naive Bayes classifier (choose likelihoods and their parameterization) for the data in Table 13.1 using Normal distributions with fixed variance of 1 and Bernoulli distributions. Compute the maximum likelihood estimate of all parameters θ required for naive Bayes.
b) You observe a new data point x(b) = yes
for x(b). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
c) Next, you get another data point x(d) = n/a n/a, where all values are missing. Compute the unnormalized posterior over classes p(y(d) | x(d), θ) for x(d). Simplify as far as possible without evaluating exponential functions, square roots, logarithms, etc. Briefly justify how you arrive at your solution. What is the most likely class for this data point?
d) Finally, you see a data point x(c) = n/a no, i.e. the features are only partially known. Compute the unnormalized posterior over classes p(y(c) | x(c), θ) for x(c). Simplify as far as possible without evaluating exponential functions, square r