comp9417 Final Exam Q3

Question 3
Please submit Question3.pdf on Moodle using the Final Exam – Question 3 object. You must submit a singlePDF.Youmaysubmitmultiple.pyfilesifyouwish. Thepartsareworth4+3+8+1+3+2+3 + 1 = 25.
In the 9417 group project, many of you applied gradient boosting, which is sometimes regarded as the best out-of-the-box learning algorithm. In this question, we will derive and implement the gradient boosting algorithm from scratch, and apply it to a toy data set. The gradient boosting algorithm can be described as follows: Let F be a set of base learning algorithms6, and let l(y, yˆ) be a loss function, where y is the target, and yˆ is the predicted value. Note that this set-up is general enough to include both regression and classification algorithms. Suppose we have data (xi , yi ) for i = 1, . . . , n, which we collect into a single data set D0. We then set the number of desired base learners to T, and proceed as follows:
(I) Initialize f0(x) = 0 (i.e. f0 is the zero function.) (II) For t = 1,2,…,T:
(GB1) Compute:
rt,i = −∂f(xi) l(yj,f(xj))
f(xj)=ft−1(xj), j=1,…,n
for i = 1, . . . , n. We refer to rt,i as the i-th pseudo-residual at iteration t.
(GB2) Constructanewpseudodataset,Dt,consistingofpairs:(xi,rt,i)fori=1,…,n.
(GB3) Fit a model to Dt using our base class F. That is, we solve
ht =argminXl(rt,i,f(xi))
(GB4) Choose a step-size. This can be done by either of the following methods:
(SS1) Pick a fixed step-size αt = α
(SS2) Pick a step-size adaptively according to
(GB5) Take the step
(III) return fT .
αt =argminXl(yi,ft−1(xi)+αht(xi)).
ft(x) = ft−1(x) + αtht(x).
We can view this algorithm as either a generalization of AdaBoost covered in lectures/labs, or as per- forming (functional) gradient descent on the base class F. Note that in (GB1), the notation means that after taking the derivative with respect to f(xi), set all occurences of f(xj) int he resulting expression with the prediction of the current model ft−1(xj), for all j. For example,
∂ log(x+1) = 1 = 1. ∂x x=23 x+1 x=23 24
6For example, you could take F to be the set of all depth-1 decision trees, often referred to as decision stumps. Alternatively, F could be the set of all depth-2 trees, or the set of all 1-layer neural networks, etc.

CS Help, Email: tutorcs@163.com
(a) Consider the regression setting where we allow the y-values in our data set to be real numbers. Suppose that we use squared error loss l(y, yˆ) = 21 (y − yˆ)2. For round t of the algorithm, show that rt,i = yi − ft−1(xi). Then, write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(b) Using the same setting as in the previous part, show that the adaptive approach (SS2) leads to a step-size:
Pni=1 ht(xi)(yi − ft−1(xi)) αt = Pni=1(ht(xi))2 .
What to submit: your working out, either typed or handwritten.
(c) Wewillnowimplementgradientboostingonatoydatasetfromscratch,andwewillusetheclassof decision stumps (depth 1 decision trees) as our base class (F ), and squared error loss as in the previ-
ous parts. In your implementation, you may make use of sklearn.tree.DecisionTreeRegressor, but all other code must be your own7. The following code generates the data and demonstrates plotting the predictions of a fitted decision tree (more details in q3.py):
1 2 3 4 5 6 7 8 9
The figure generated is
7you can use NumPy and matplotlib, but do not use an existing implementation of gradient boosting.
np.random.seed(123)
X, y = f_sampler(f, 160, sigma=0.2)
X = X.reshape(-1,1)
fig = plt.figure(figsize=(7,7))
dt = DecisionTreeRegressor(max_depth=2).fit(X,y) # example model
xx = np.linspace(0,1,1000)
plt.plot(xx, f(xx), alpha=0.5, color=’red’, label=’truth’)
plt.scatter(X,y, marker=’x’, color=’blue’, label=’observed’)
plt.plot(xx, dt.predict(xx.reshape(-1,1)), color=’green’, label=’dt’) # plotting
example model
plt.legend()
plt.show()
程序代写 CS代考 加微信: cstutorcs
Your task is to generate a 5 x 2 figure of subplots showing the predictions of your fitted gradient boosted model. There are 10 subplots in total, the first should show the model with 5 base learners, the second subplot should show it with 10 base learners, etc. The last subplot should be the gradi- ent boosted model with 50 base learners. Each subplot should include the scatter of data, as well as a plot of the true model (basically, the same as the plot provided above but with your implemented gradient boosted model in place of dt). Comment on your results, what happens as the number of base learners is increased? You should do this two times (two 5×2 plots), once with the adaptive step size, and the other with the step-size taken to be α = 0.1 fixed throughout. There is no need to split into train and test data here. Comment on the differences between your fixed and adaptive step-size implementations. How does your model perform on different parts of the data?
What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commentary, and a screen shot of your code and a copy of your code in your .py file.
(d) Repeat the analysis in the previous question but with depth 2 decision trees as base learners in- stead. Provide the same plots. What do you notice for the adaptive case? What about the non- adaptive case? What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commen- tary, and a copy of your code in your .py file.
(e) Now, consider the classification setting where y is taken to be an element of {−1, 1}. We consider the following classification loss: l(y, yˆ) = log(1 + e−yyˆ). For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(f) Using the same setting as in the previous part, write down an expression for αt using the adaptive approach in (SS2). Can you solve for αt in closed form? Explain.
What to submit: your working out, either typed or handwritten, and some commentary.

(g) Consider a different classification loss: l(y,yˆ) = e−yyˆ. For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it). Further, can you find an expression for αt in (SS2) for this setting? Explain.
What to submit: your working out, either typed or handwritten.
(h) In practice, if you cannot solve for αt exactly, explain how you might implement gradient boosting. Assume that using a constant step-size is not a valid alternative. Be as specific as possible in your answer. What, if any, are the additional computational costs of your approach relative to using a constant step size ?
What to submit: some commentary.
浙大学霸代写 加微信 cstutorcs