COMP9417 Final Exam 22T2

NAME OF CANDIDATE: …………………………………………….. STUDENT ID: …………………………………………….. SIGNATURE: ……………………………………………..
THE UNIVERSITY OF NEW SOUTH WALES Term 2, 2022
COMP9417 Machine Learning and Data Mining – Final Examination
1. TIME ALLOWED — 24 HOURS
2. THIS EXAMINATION PAPER HAS 12 PAGES
3. TOTAL NUMBER OF QUESTIONS — 4
4. ANSWER ALL 4 QUESTIONS
5. TOTAL MARKS AVAILABLE — 100
6. OPEN BOOK EXAM – LECTURE NOTES, TUTORIALS, AND ONLINE RESOURCES ARE PERMITTED. PLEASE REFER TO EXAM INSTRUCTIONS ON MOODLE COURSE PAGE FOR SUBMISSION AND OTHER GUIDANCE.
7. DISCUSSION WITH OTHER STUDENTS IS STRICTLY PROHIBITED. EXAM SUBMISSIONS WILL BE CHECKED FOR PLAGIARISM. CHEATING WILL RESULT IN A FAILING GRADE FOR THE COURSE AND POTENTIAL FURTHER DISCIPLINARY ACTION.

Question 1
Please submit Question1.pdf on Moodle using the Final Exam – Question 1 object. You must submit a single PDF. You may submit multiple .py files (placed in a single zip file) if you wish. Do not put your pdfinthezipfile. Thepartsareworth(2+4+4+3)+(3+3+6)=25.
(a) (Bias, Variance & MSE) Let X1, . . . , Xn be i.i.d. random variables with mean μ and variance σ2. Define the estimator T = Pni=1 aiXi for some constants a1, . . . , an.
(i) What condition must the ai’s satisfy to ensure that T is an unbiased estimator of μ? Unbiased means that ET = μ.
What to submit: your working out, either typed or handwritten.
(ii) Under the condition identified in the previous part, which choice of the ai’s will minimize the MSE of T? Does this choice of ai’s surprise you? Provide some brief discussion. Hint: this is a constrained minimization problem.
What to submit: your working out, either typed or handwritten, and some commentary.
(iii) Suppose that instead, we let ai = 1+b for i = 1, . . . , n, where b is some constant. Find the value n
of b (in terms of μ and σ2) which minimizes the MSE of T . How does the answer here compare to the estimator found in the previous part? How do the two compare as the sample size n increases? Are there any obvious issues with using the result of this question in practice? What to submit: your working out, either typed or handwritten, and some commentary.
(iv) Suppose now that you are told that σ2 = μ2. For the choice of b identified in the previous part, find the MSE of the estimator in this setting. How does the MSE compare to the MSE of the sample average X? (Recall that MSE(X) = var(X) = σ2/n = μ2/n). Further, explain whether or not you can use this choice of b in practice.
What to submit: your working out, either typed or handwritten, and some commentary.
(b) (kNN Regression) Consider the usual data generating process y = f(x) + ε, where f is some unknown function, and ε is a noise variable with mean zero and variance σ2. Recall that in kNN regression, we look at the k nearest neighbours (in our dataset) of an input point x0, we then consider their corresponding response values and average them to get a prediction for x0. Given a dataset D = {(xi , yi )}ni=1 we can write down the kNN prediction as
mˆ(x0)=k1 X yi, i∈Nk (x0 )
where Nk(x0) is the set of indices of the k nearest neighbours of x0. Without loss of generality, label the k nearest neighbours of x0 as z1, . . . , zk and their corresponding response values by t1, . . . , tk.
(i) Show that
2 1Xk!2 [bias(mˆ(x0))] = f(x0)− k f(zi) .
Throughout, you should treat x0 as a fixed point (not a random variable). You may use any results from tutorials, labs or lectures without proof1.
What to submit: your working out, either typed or handwritten.
(ii) Derive an expression for the variance var(mˆ (x0)).
What to submit: your working out, either typed or handwritten.
1Please reference any results that you use, e.g. by stating that a particular result follows from Tutorial A, question B, part C.

(iii) Using the results so far, write down an expression for the MSE of mˆ (x0). Describe what hap- pens to the bias of the kNN estimator at x0 when k is very small (1NN), and what happens when k is very large (k → ∞). Similarly, what happens to the variance? What does this tell you about the relationship between bias and variance and choice of k?
What to submit: your working out, either typed or handwritten, and some commentary.
Code Help, Add WeChat: cstutorcs
Question 2
Please submit Question2.pdf on Moodle using the Final Exam – Question 2 object. You must submit a single PDF. You may submit multiple .py files (placed in a single zip file) if you wish. Do not put your pdf in the zip file. The parts are worth (1+3+3+6) + (4+2+6) = 25.
(a) (Kernel SVM) Consider the following 2-dimensional data-set, where y denotes the class of each point.
index x1 x2 y 1 1 0 -1 2 0 1 -1 3 0 -1 -1 4 -1 0 +1 5 0 2 +1 6 0 -2 +1 7 -2 0 +1
Throughout this question, you may use any desired packages to answer the questions.
(i) Use the transformation x = (x1, x2) 7→ (φ1(x), φ2(x)) where φ1(x) = 2×2 − 4×1 + 1 and φ2(x) = x21 − 2×2 − 3. What is the equation of the best separating hyper-plane in the new feature space? Provide a plot with the data set and hyperplane clearly shown.
What to submit: a single plot, the equation of the separating hyperplane, a screen shot of your code, a copy of your code in your .py file for this question.
(ii) Fit a hard margin linear SVM to the transformed data-set in the previous part2. What are the estimated values of (α1 , . . . , α7 ). Based on this, which points are the support vectors? What error does your computed SVM achieve?
What to submit: the indices of your identified support vectors, the train error of your SVM, the com- puted α’s (rounded to 3 d.p.), a screen shot of your code, a copy of your code in your .py file for this question.
(iii) Consider now the kernel k(x, z) = (2 + x⊤z)2. Run a hard-margin kernel SVM on the original (untransformed) data given in the table at the start of the question. What are the estimated values of (α1 , . . . , α7 ). Based on this, which points are the support vectors? What error does your computed SVM achieve?
What to submit: the indices of your identified support vectors, the train error of your SVM, the com- puted α’s (rounded to 3 d.p.), a screen shot of your code, a copy of your code in your .py file for this question.
(iv) Provide a detailed argument explaining your results in parts (i), (ii) and (iii). Your argument should explain the similarities and differences in the answers found. In particular, is your answer in (iii) worse than in (ii)? Why? To get full marks, be as detailed as possible, and use mathematical arguments or extra plots if necessary.
What to submit: some commentary and/or plots. If you use any code here, provide a screen shot of your code, and a copy of your code in your .py file for this question.
2If you are using the SVC class in sklearn, to get a hard-margin svm, you need to set the hyper parameter C to be very large.

(b) (Test Set Linear Regression) Recall that the rMSE (root-MSE) of a hypothesis3 h on a test set {(xi , yi )}ni=1 , where xi is a p-dimensional vector and yi is a real number, is defined as
rMSE(h) = tn
For all parts, assume that the xi’s are known to you, but the yi’s are not. Suppose however that you are permitted to query rMSE(h). What this means is that you can query, for any hypothesis h, the rMSE of h on the test set. Suppose further that you know that rMSE(z) = c0, where z is the hypothesis that returns zero for any input, i.e. z(xi) = 0 for each i = 1, . . . , n, and c0 is some arbitrary positive number.
(i) AssumeyouhaveasetofThypothesesH={h1,h2,…,hT}.Youaretoldthatforeachi,there is a hypothesis h in H, such that h(xi) = yi. In other words, for any point in the test set, there is at least one hypothesis in H that predicts that point correctly4. Suppose that you are permitted to blend predictions of different hypotheses in H5. Design a naive, brute-force algorithm that constructs a hypothesis g from the elements of H such that rMSE(g) = 0. How many queries of rMSE do you need to make? How does your algorithm scale (in the worst case) with the test size n? Describe your algorithm in detail.
What to submit: a description of your algorithm and the number of queries required, either typed or
handwritten.
(ii) We now consider a better approach than brute-force. For a given hypothesis h, define h(X) = (h(x1), h(x2), . . . , h(xn))⊤
y = (y1,y2,…,yn)⊤.
We can always compute h(X), but we do not know y. What is the smallest number of queries required to compute y⊤h(X)? Describe your approach in detail.
What to submit: a description of your approach and the number of queries required, either typed or handwritten.
(iii) GiveasetofKhypothesesH={h1,…,hK},useyourresultinthepreviousparttosolve
! min rMSE Xαkhk ,
α1 ,…,αK
and obtain the optimal weights α1 , . . . , αK . Describe your approach in detail, and be sure to detail how many queries are needed and the exact values of the α’s, in terms of X, y and the elements of H.
What to submit: a description of your algorithm and the number of queries required, either typed or handwritten.
3The term hypothesis just means a function or model that takes as input x and returns as output a prediction h(x) = yˆ.
4Note that this does not imply that there is some hypothesis in H that predicts all points correctly.
5For example, you could construct a blended hypothesis g that returns the predictions of h2 on test points 1 to 5, and the predictions
of h5 on points 6 to n.
(yi − h(xi))2.

Question 3
Please submit Question3.pdf on Moodle using the Final Exam – Question 3 object. You must submit a singlePDF.Youmaysubmitmultiple.pyfilesifyouwish. Thepartsareworth4+3+8+1+3+2+3 + 1 = 25.
In the 9417 group project, many of you applied gradient boosting, which is sometimes regarded as the best out-of-the-box learning algorithm. In this question, we will derive and implement the gradient boosting algorithm from scratch, and apply it to a toy data set. The gradient boosting algorithm can be described as follows: Let F be a set of base learning algorithms6, and let l(y, yˆ) be a loss function, where y is the target, and yˆ is the predicted value. Note that this set-up is general enough to include both regression and classification algorithms. Suppose we have data (xi , yi ) for i = 1, . . . , n, which we collect into a single data set D0. We then set the number of desired base learners to T, and proceed as follows:
(I) Initialize f0(x) = 0 (i.e. f0 is the zero function.) (II) For t = 1,2,…,T:
(GB1) Compute:
rt,i = −∂f(xi) l(yj,f(xj))
f(xj)=ft−1(xj), j=1,…,n
for i = 1, . . . , n. We refer to rt,i as the i-th pseudo-residual at iteration t.
(GB2) Constructanewpseudodataset,Dt,consistingofpairs:(xi,rt,i)fori=1,…,n.
(GB3) Fit a model to Dt using our base class F. That is, we solve
ht =argminXl(rt,i,f(xi))
(GB4) Choose a step-size. This can be done by either of the following methods:
(SS1) Pick a fixed step-size αt = α
(SS2) Pick a step-size adaptively according to
(GB5) Take the step
(III) return fT .
αt =argminXl(yi,ft−1(xi)+αht(xi)).
ft(x) = ft−1(x) + αtht(x).
We can view this algorithm as either a generalization of AdaBoost covered in lectures/labs, or as per- forming (functional) gradient descent on the base class F. Note that in (GB1), the notation means that after taking the derivative with respect to f(xi), set all occurences of f(xj) int he resulting expression with the prediction of the current model ft−1(xj), for all j. For example,
∂ log(x+1) = 1 = 1. ∂x x=23 x+1 x=23 24
6For example, you could take F to be the set of all depth-1 decision trees, often referred to as decision stumps. Alternatively, F could be the set of all depth-2 trees, or the set of all 1-layer neural networks, etc.

(a) Consider the regression setting where we allow the y-values in our data set to be real numbers. Suppose that we use squared error loss l(y, yˆ) = 21 (y − yˆ)2. For round t of the algorithm, show that rt,i = yi − ft−1(xi). Then, write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(b) Using the same setting as in the previous part, show that the adaptive approach (SS2) leads to a step-size:
Pni=1 ht(xi)(yi − ft−1(xi)) αt = Pni=1(ht(xi))2 .
What to submit: your working out, either typed or handwritten.
(c) Wewillnowimplementgradientboostingonatoydatasetfromscratch,andwewillusetheclassof decision stumps (depth 1 decision trees) as our base class (F ), and squared error loss as in the previ-
ous parts. In your implementation, you may make use of sklearn.tree.DecisionTreeRegressor, but all other code must be your own7. The following code generates the data and demonstrates plotting the predictions of a fitted decision tree (more details in q3.py):
1 2 3 4 5 6 7 8 9
The figure generated is
7you can use NumPy and matplotlib, but do not use an existing implementation of gradient boosting.
np.random.seed(123)
X, y = f_sampler(f, 160, sigma=0.2)
X = X.reshape(-1,1)
fig = plt.figure(figsize=(7,7))
dt = DecisionTreeRegressor(max_depth=2).fit(X,y) # example model
xx = np.linspace(0,1,1000)
plt.plot(xx, f(xx), alpha=0.5, color=’red’, label=’truth’)
plt.scatter(X,y, marker=’x’, color=’blue’, label=’observed’)
plt.plot(xx, dt.predict(xx.reshape(-1,1)), color=’green’, label=’dt’) # plotting
example model
plt.legend()
plt.show()

Your task is to generate a 5 x 2 figure of subplots showing the predictions of your fitted gradient boosted model. There are 10 subplots in total, the first should show the model with 5 base learners, the second subplot should show it with 10 base learners, etc. The last subplot should be the gradi- ent boosted model with 50 base learners. Each subplot should include the scatter of data, as well as a plot of the true model (basically, the same as the plot provided above but with your implemented gradient boosted model in place of dt). Comment on your results, what happens as the number of base learners is increased? You should do this two times (two 5×2 plots), once with the adaptive step size, and the other with the step-size taken to be α = 0.1 fixed throughout. There is no need to split into train and test data here. Comment on the differences between your fixed and adaptive step-size implementations. How does your model perform on different parts of the data?
What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commentary, and a screen shot of your code and a copy of your code in your .py file.
(d) Repeat the analysis in the previous question but with depth 2 decision trees as base learners in- stead. Provide the same plots. What do you notice for the adaptive case? What about the non- adaptive case? What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commen- tary, and a copy of your code in your .py file.
(e) Now, consider the classification setting where y is taken to be an element of {−1, 1}. We consider the following classification loss: l(y, yˆ) = log(1 + e−yyˆ). For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(f) Using the same setting as in the previous part, write down an expression for αt using the adaptive approach in (SS2). Can you solve for αt in closed form? Explain.
What to submit: your working out, either typed or handwritten, and some commentary.

(g) Consider a different classification loss: l(y,yˆ) = e−yyˆ. For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it). Further, can you find an expression for αt in (SS2) for this setting? Explain.
What to submit: your working out, either typed or handwritten.
(h) In practice, if you cannot solve for αt exactly, explain how you might implement gradient boosting. Assume that using a constant step-size is not a valid alternative. Be as specific as possible in your answer. What, if any, are the additional computational costs of your approach relative to using a constant step size ?
What to submit: some commentary.
Programming Help, Add QQ: 749389476
Question 4
Please submit Question4.pdf on Moodle using the Final Exam – Question 4 object. You must submit a single PDF. You may submit multiple .py files if you wish. The parts are worth (2+3) + (3+3) + (2+3+3) + (1+1+4) = 5+6+8+6= 25.
Throughout this question, try to keep your answers as brief as possible. For discussion questions you need to provide 2-3 sentences of explanation at most. For calculation questions, please circle your final answers and show all working.
(a) (Perceptron Power)
(i) Consider the following Boolean functions f1 , f2 , . . . , f5 which each take two binary inputs x1 and x2, and return a binary output. Their outputs can be represented in the following table (each column represents the output of fi on the corresponding inputs x1 and x2.)
x1 x2 f1 f2 f3 f4 f5 0000001 0101100 1000110 1101001
Which of the functions can be learned exactly by a perceptron? Explain why.
What to submit: some commentary.
(ii) Consider the following two layer network which implements the XOR function.
The numbers on the edges represent weights, and the numbers inside the nodes represent thresholds. The nodes return 1 if the inputis larger than the threshold value. For example, the top node in the middle layer takes as input w1x1 + w2x2 = x1 − x2, and if this is larger than the threshold b1 = 0.5, outputs a value of 1, and 0 otherwise. You should check that this network implements the XOR function by testing it for all possible boolean inputs (x1,x2), and comparing it to the XOR truth table. It turns out that a more efficient representation of the XOR function can be constructed, which is of the following form:

Provide weights and thresholds w1 , . . . , w5 and b1 , b2 that implement the XOR function using this new architecture. Demonstrate that the network works by computing its output on all possible inputs. Here, your weights w1 , . . . , w5 must be whole numbers, e.g. ±1, ±2, ±3, . . . , and your thresholds b1 , b2 should be chosen from 0.5, 1, 1.5, 2, 2.5, . . . . Try to find the smallest thresholds/weights that work.
What to submit: the weights and thresholds for the new network, along with a demonstration that the network does what it is meant to.
(b) (K-means) Note that the sub-questions in this question are independent of each other.
(i) Consider the following two-dimensional points: {(5, 3), (4, 4), (6, 3), (5, 4), (2, 3)}. You run K-means (using Euclidean distance) on this dataset with K = 2 and initial cluster centers C0 = {(5, 2), (4, 5)}, where the subscript 0 denotes that these are the cluster centers at time 0. Compute C1 and C2, the cluster centers after 1 and 2 iterations of the K-means algorithm. Be sure to detail your working.
What to submit: your cluster centers and any working, either typed or handwritten.
(ii) Your friend, who studied accounting, is now promoted to the role of Data Scientist at his consulting company. They excitedly tell you that they’re working on a clustering problem. You ask for more details and they tell you they have an unlabelled dataset with p = 10000 features and they ran K-means clustering using Euclidean distance. They identified 52 clusters and managed to define labellings for these clusters based on their expert domain knowledge. What do you think about the usage of K-means here? Do you have any criticisms?
What to submit: some commentary.
(c) (Learning Theory) Note that the sub-questions in this question are independent of each other.
(i) Consider a hypothesis space H consisting of 1000 hypotheses. At least how many samples must you provide the learning algorithm in order to assure that with probability .98, the learner will output a hypothesis in H with true accuracy .95?
What to submit: your answer, and any working either typed or handwritten.
(ii) Consider a hypothesis class F with VC dimension equal to 4. Consider the following state- ments:
(I) Given a dataset of 4 samples, there is at least one classifier in F that can achieve an error of zero on these points.
(II) Classifiers in F will always achieve non-zero error on any dataset of size 5. (III) Since the VC dimension of F is finite, the size of H must be finite.
Which of the statements are true? Which are false? Explain why or provide a counter-example.
What to submit: your answer, and any working either typed or handwritten.
(iii) Consider the hypothesis class V which contains all hypotheses of the form (1 ifx∈[a,b]
va,b(x) = 0 therwise Page 11

So the classifier assigns a positive label to points inside the interval [a, b], and zero to anything outside the interval. The class V contains all such classifiers (i.e. one classifier for each interval [a, b] for every pair of real numbers a and b). What is the VC dimension of V ? Explain.
What to submit: your answer, and any working either typed or handwritten.
(d) (Tree Construction) Consider the data set below, with features: W , X and Z , and target Y .
index Y W X Z 10A00 20B11 30B11 41B20 51A20 60A01 71A21 81B10 91B10
10 0 A 0 1 Throughout this question, use log2 in your calculations.
(i) What is the entropy of Y ?
What to submit: your answer, and any working either typed or handwritten.
(ii) Suppose that feature X is chosen for the root of a decision tree. What is the information gain associated with this attribute?
What to submit: your answer, and any working either typed or handwritten.
(iii) Draw the full decision tree (without pruning) and with X as the feature chosen for the root node. Label your tree clearly and outline the prediction made at each leaf node. What can you say about feature W ? Show all working.
What to submit: your answer, and any working either typed or handwritten.
CS Help, Email: tutorcs@163.com