COMP9417 exam

COMP9417: A collection of sample exam exercises August 9, 2022
Note to student: Some of these questions are longer/more difficult than the questions that will be on the actual final exam. With that being said, you should aim to work through and understand all questions here as part of your revision. Note that this is not a sample final exam, it is a collection of possible exercises, and the actual final exam will be shorter.
Question 1
Note: this question was incorporated into the Ensemble learning lab and the code there is much im- proved. Refer to that lab for solutions. This question requires you to implement the Adaptive Boosting Algorithm from lectures. Use the following code to generate a toy binary classification dataset:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
np.random.seed(2)
n_points = 15
X, y = make_blobs(n_points, 2, centers=[(0,0), (-1,1)])
y[y==0] = -1 # use -1 for negative class instead of 0
plt.scatter(*X[y==1].T, marker=”+”, s=100, color=”red”)
plt.scatter(*X[y==-1].T, marker=”o”, s=100, color=”blue”)
plt.show()
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
Your data should look like:

(a) By now, you will be be familiar with the scikitlearn DecisionTreeClassifier class. Fit Decision trees of increasing maximum depth for depths ranging from 1 to 9. Plot the decision boundaries of each of your models in a 3 × 3 grid. You may find the following helper function useful:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
(b) We now restrict attention to trees of depth 1. These are the most basic decision trees and are commonly referred to as decision stumps. Consider the adaptive boosting algorithm presented in the ensemble methods lecture notes on slide 50/70. In adaptive boosting, we build a model composed of T weak learners from a set of weak learners. At step t, we pick a model from the set of weak learners that minimises weighted error:
εt = X wt−1,iI{yi ̸= yˆi}
where wt−1,i is the weight at the previous step for observation i, and I{yi ̸= yˆi} is equal to 1 if
yi ̸= yˆi and zero otherwise. We do this for a total of T steps, which gives us a boosted model Page 2
def plotter(classifier, X, y, title, ax=None):
# plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired)
ax.scatter(X[:, 0], X[:, 1], c = y)
ax.set_title(title)
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.scatter(X[:, 0], X[:, 1], c = y)
plt.title(title)

composed of T base classifiers:
M(x) = X αtMt(x) t=1
where αt is the weight assigned to the t-th model. Classification is then carried out by assigning a point to the positive class if M(x) > 0 or to the negative class if M(x) < 1. Here we will take the class of weak learners to be the class of Decision stumps. You may make use of the ‘sample weight’ argument in the ‘fit()’ method to assign weights to the individual data points. Write code to build a boosted classifier for T = 15. Demonstrate the performance of your model on the generated dataset by printing out a list of your predictions versus the true class labels. (note: you may be concerned that the decision tree implementation in scikit learn does not actually minimise εt even when weights are assigned, but we will ignore this detail for the current question). (c) In this question, we will extend our implementation in (c) to be able to use the plotter function in (b). To do this, we need to implement a boosting model class that has a ‘predict’ method. Once you do this, repeat (c) for T = [2,...,17]. Plot the decision boundary of your 16 models in a 4 × 4 grid. The following template may be useful: 1 2 3 4 5 6 7 8 (d) Discuss the differences between bagging and boosting. class boosted_model: def __init__(self, T): self.alphas = # your code here # your code here def predict(self, x): # your code here Question 2 Note: this question is actually a bit longer and (slightly) more difficult than an exam question, but it is good practice In this question, you will be required to manually implement Gradient Descent in Python to learn the parameters of a linear regression model. You will make use of the ‘real estate.csv’ dataset, which you can download directly from the course Moodle page. The dataset contains 414 real estate records, each of which contains the following features: • transactiondate: date of transaction • age: age of property • nearestMRT: distance of property to nearest supermarket • nConvenience: number of convenience stores in nearby locations • latitude • longitude The target variable is the property price. The goal is to learn to predict property prices as a function of a subset of the above features. (a) A number of pre-processing steps are required before we attempt to fit any models to the data. First remove any rows of the data that contain a missing (‘NA’) value. List the indices of the removed data points. Then, delete all features from the dataset apart from: age, nearestMRT and nConvenience. The indices of rows with missing values are: [19, 41, 109, 144, 230, 301]. 1 2 3 4 5 6 7 import numpy as np import matplotlib.pyplot as plt import pandas as pd df = pd.read_csv("real_estate.csv") df = df.dropna() X = df[["age", "nearestMRT", "nConvenience"]].values (b) An important pre-processing step: feature normalisation. Feature normalisation involves rescaling the features such that thay all have similar scales. This is also important for algorithms like gradient descent to ensure the convergence of the algorithm. One common technique is called min-max normalisation, in which each feature is scaled to the range [0, 1]. To do this, for each feature we must find the minimum and maximum values in the available sample, and then use the following formula to make the transformation: xnew = x−min(x) . max(x) − min(x) After applying this normalisation, the minimum value of your feature will be 0, and the maximum will be 1. For each of the features, provide the mean value over your dataset. The mean values for the three features are: [0.40607933, 0.16264268, 0.4120098]. # preprocessing from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X = scaler.fit_transform(X) y = df["price"].values 1 2 3 4 5 6 (c) Nowthatthedataispre-processed,wewillcreatetrainandtestsetstobuildandthenevaluateour models on. Use the first half of observations (in the same order as in the original csv file excluding those you removed in the previous question) to create the training set, and the remaining half for the test set. Print out the first and last rows of both your training and test sets. • first row Xtrain: [0.73059361, 0.00951267, 1.] • last row Xtrain: [0.87899543, 0.09926012, 0.3] • first row Xtest: [0.26255708, 0.20677973, 0.1] • last row Xtest: [0.14840183, 0.0103754, 0.9] • first row ytrain: 37.9 • last row ytrain: 34.2 • first row ytest: 26.2 • last row ytest: 63.9 # train/test splot X = np.concatenate((np.ones([len(X), 1]), X), axis=1) split_point = X.shape[0] // 2 X_train = X[:split_point] X_test = X[split_point:] y_train = y[:split_point] y_test = y[split_point:] # printing print("first row Xtrain: ", Xtrain[0]) print("last row Xtrain: ", Xtrain[-1]) print("first row Xtest: ", Xtest[0]) print("last row Xtest: ", Xtest[-1]) print("first row ytrain: ", ytrain[0]) print("last row ytrain: ", ytrain[-1]) print("first row ytest: ", ytest[0]) print("last row ytest: ", ytest[-1]) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 (d) Consider the loss function r12 Lc(x,y)= c2(x−y) +1−1, where c ∈ R is a hyper-parameter. Consider the (simple) linear model yˆ(i) =w +w x(i) +w x(i) +w x(i), i=1,...,n. We can write this more succintly by letting w = (w ,w ,w ,w )T and X(i) = (1,x(i),x(i),x(i))T , so 0123 123 that yˆ(i) = wT X(i). The mean-loss achieved by our model (w) on a given dataset of n observations is then Lc(y, yˆ) = n Compute the following derivatives: 1 Xn " r 1 # c2 (y(i) − ⟨w(t), X(i)⟩)2 + 1 − 1 , k = 0,1,2,3. Lc(y(i), yˆ(i)) = n ∂Lc(y(i),yˆ(i)), ∂wk You must show your working for full marks. (e) For some value of c > 0, we have
After applying the chain rule twice, we get results similar to the L2 loss cases:
∂Lc(y(i), yˆ(i)) ∂wk
X(i)(⟨w(t), X(i)⟩ − y(i)) k
q , k = 0, 1, 2, 3.
1 (⟨w(t), X(i)⟩ − y(i))2 + 1 c2
∂Lc(y(i), yˆ(i)) x(i)(⟨w(t), X(i)⟩ − y(i)) k
∂w =p(t)(i) (i)2 ,k=0,1,2,3. k 2 (⟨w ,X ⟩−y ) +4
Using this result, write down the gradient descent updates for w0, w1, w2, w3 (using pseudocode), assuming a step size of η. Note that in gradient descent we consider the loss over the entire dataset, not just at a single observation. Further, provide pseudocode for stochastic gradient descent up- dates. Here, w(t) denotes the weight vector at the t-th iteration of gradient descent.
We have the following updates for GD, for t = 1,…,T:
(t+1) (t) 1 Xn x(i)(⟨w(t), X(i)⟩ − yi)
wk =wk −ηn i=1 2p(⟨w(t),X(i)⟩−yi)2 +4 Similarly for SGD, we have updates
(t+1) (t) x(i)(⟨w(t), X(i)⟩ − yi) k
wk =wk −η2p(⟨w(t),X(i)⟩−yi)2 +4
程序代写 CS代考加微信: cstutorcs
(f) In this section, you will implement gradient descent from scratch on the generated dataset using the gradients computed in Question 3, and the pseudocode in Question 4. Initialise your weight vector to w(0) = [1, 1, 1, 1]T . Consider step sizes
η ∈ {10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01}
(a total of 9 step sizes). For each step-size, generate a plot of the loss achieved at each iteration of gradient descent. You should use 400 iterations in total (which will require you to loop over your training data in order). Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file). The following code may help with plotting:
fig, ax = plt.subplots(3,3, figsize=(10,10))
nIter = 400
alphas = [10,5,2, 1,0.5, 0.25,0.1, 0.05, 0.01]
for i, ax in enumerate(ax.flat):
# losses is a list of 9 elements. Each element is an array of length
nIter storing the loss at each iteration for
# that particular step size
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”) # plot titles
plt.tight_layout() # plot formatting
plt.show()
The plot should look like:

If the student did not figure out that the provided gradient corresponds to the case c = 2, then they were told they could use either c = 1 or c = 2. This affects their loss values, even though they are using the correct gradients, and in that case their plots would look like:

Example code is as follows:
1 2 3 4 5 6
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
def calc_loss(y, y_pred, c=2):
return np.mean(np.sqrt((1/c**2) * (y – y_pred)**2 + 1) – 1)
def calc_grad(X_train, y_train, w, c=2):
Xw = X_train @ w
const = (Xw – y_train) / (c**2 * np.sqrt((1/c**2) * (Xw – y_train)**2
grad = np.mean(X_train * np.repeat(const, 4).reshape(-1, 4), axis=0)
return grad
nmb_iter = 400
w = np.zeros(shape=(nmb_iter+1, 4))
w[0] = np.array([1,1,1,1])
alphas = [10,5,2,1,0.5,0.25,0.1,0.05, 0.01]
losses = []
for alpha in alphas:
train_loss = np.zeros(shape=(nmb_iter+1))
train_loss[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter+1):
w[i] = w[i-1] – alpha * calc_grad(Xtrain, ytrain, w[i-1])
train_loss[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
losses.append(train_loss)

fig, ax = plt.subplots(3,3, figsize=(10,10))
for i, ax in enumerate(ax.flat):
ax.plot(losses[i])
ax.set_title(f”step size: {alphas[i]}”)
plt.tight_layout()
plt.savefig(“GDgrid.png”, dpi=500)
plt.show()
25 26 27 28 29 30 31 32 33
(g) Fromyourresultsinthepreviouspart,chooseanappropriatestepsize(andstateyourchoice),and explain why you made this choice.
(h) For this part, take η = 0.3, re-run GD under the same parameter initilisations and number of iterations. On a single plot, plot the progression of each of the four weights over the iterations. Print out the final weight vector. Finally, run your model on the train and test set, and print the achieved losses.
η = 10 seems like the best choice.
w = [26.11838076, 7.55495429, 0.22091267, 15.88216107] Rerunning this model gives the following results:
• train loss: 4.089850739201336 / 8.887715476132039 (if c = 1)
• test loss: 3.8275009292188362 / 8.351707785988028 (if c = 1) The plot of the weights look like:
CS Help, Email: tutorcs@163.com
w = np.zeros(shape=(nmb_iter,4))
w[0] = np.array([1,1,1,1])
alpha = 0.3
for i in range(1, nmb_iter):
w[i] = w[i-1] – alpha * calc_grad(Xtrain, ytrain, w[i-1])
plt.plot(w[:,0], label=”w0″)
plt.plot(w[:,1], label=”w1″)
plt.plot(w[:,2], label=”w2″)
plt.plot(w[:,3], label=”w3″)
plt.legend()
plt.savefig(“GDweights.png”, dpi=400)
plt.show()
wstar = w[nmb_iter-1]
print(“train loss: “, calc_loss(ytrain, Xtrain @ wstar, c=cur_c))
print(“test loss: “, calc_loss(ytest, Xtest @ wstar, c=cur_c))
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
(i) We will now re-run the analysis in the previous part, but using stochastic gradient descent (SGD) instead. In SGD, we update the weights after seeing a single observation. Use the same settings as in the gradient descent implementation. For each step-size, generate a plot of the loss achieved at each iteration of stochastic gradient descent, which is to be run for 6 Epochs. An epoch is one pass over the entire training dataset, so in total there should be 6 × ntrain updates, where ntrain is the size of your training data. Be sure to run these updates in the same ordering that the data is stored. Generate a 3 × 3 grid of plots showing performance for each step-size. Add a screen shot of the python code written for this question in your report (as well as pasting the code in to your solutions.py file).
The plot should look like:

If students use c = 1, then

Example code is as follows:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22
def calc_grad_i(w, Xi, yi, c=2):
# compute gradient for a single observation
num = xw-yi
den = (c**2) * np.sqrt(1 + (1/c**2) * (xw-yi)**2)
return (num/den)*Xi
epochs = 6
nmb_iter = Xtrain.shape[0] * epochs
w = np.zeros(shape=(nmb_iter,4))
w[0] = np.array([1,1,1,1])
alphas = [10, 5, 2, 1, 0.5, 0.25, 0.1, 0.05, 0.01]
losses = []
for alpha in alphas:
train_loss = np.zeros(shape=(nmb_iter))
train_loss[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter):
idx = i % Xtrain.shape[0]
idx], c=cur_c)
w[i] = w[i-1] – alpha * calc_grad_i(w[i-1], Xtrain[idx], ytrain[
train_loss[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
losses.append(train_loss)

alpha = .4
loss_vec = np.zeros(shape=(nmb_iter))
loss_vec[0] = calc_loss(ytrain, Xtrain @ w[0], c=cur_c)
for i in range(1, nmb_iter):
idx = i % Xtrain.shape[0]
idx], c=cur_c)
w[i] = w[i-1] – alpha * calc_grad_i(w[i-1], Xtrain[idx], ytrain[
loss_vec[i] = calc_loss(ytrain, Xtrain @ w[i], c=cur_c)
plt.plot(w[:,0], label=”w0″)
plt.plot(w[:,1], label=”w1″)
plt.plot(w[:,2], label=”w2″)
plt.plot(w[:,3], label=”w3″)
plt.legend()
plt.savefig(“SGDweights.png”, dpi=500)
plt.show()
wstar = w[-1]
print(“w: “, wstar)
print(“train loss: “, ytrain, c=cur_c))
print(“test loss: “, ytest, c=cur_c))
4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
(l) In a few lines, comment on your results in Questions 5 and 6. Explain the importance of the step- size in both GD and SGD. Explain why one might have a preference for GD or SGD. Explain why the GD paths look much smoother than the SGD paths.
This is an open question so any reasonable short answer should be given full marks. Students should mention things along the lines of:
• In general, Gradient descent takes steps towards the minimum by moving in the direc- tion of greatest descent. For large step sizes, we may overshoot the minimum and this may result in oscillating or divergence of the algorithm, whereas too small a stepsize will lead to very slow convergence. We will in general find that GD uses a larger step size relative to SGD, since the direction (gradient) is more reliable in GD than in SGD.
• SGD is an approximation of GD, since in GD we use the entire training data before mak- ing a single update, whereas in SGD we make updates after seeing a single training point. The reason we might prefer SGD is when it is too computationally expensive to compute the gradient over the entire dataset, and SGD tends to converge much faster than GD for this reason. We see that in this case, SGD is able to achieve a lower mean loss, though we cannot directly compare the two algorithms here as the number of iterations differs, it would be interesting if a student provides a timing comparison for the two cases but not necessary.
• TheSGDpathsaremuchmoreturbulentbecauseweareestimatingthegradientbylook- ing at a single training point, and so the gradient updates are much noisier than when computing the gradient over the entire training set as in standard GD.
• Any other discussion points, or mention of mini-batch GD, or other variants.

Question 3
In this question, we will consider the scikitlearn implementations of the following classifiers:
• Decision Trees
• Random Forest
• AdaBoost
• Logistic Regression
• Multilayer Perceptron (Neural Network) • Support Vector Machine
You are required to compare the performance of the above models on a classifiation task. The following code loads in these classifiers and defines a function to simulate a toy dataset:
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import warnings
warnings.simplefilter(action=’ignore’, category=FutureWarning)
import time
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
def create_dataset(n=1250, nf=2, nr=0, ni=2, random_state=125):
generate a new dataset with
n: total number of samples
nf: number of features
nr: number of redundant features (these are linear combinatins of
informative features)
ni: number of informative features (ni + nr = nf must hold)
random_state: set for reproducibility
X, y = make_classification( n_samples=n,
n_features=nf,
n_redundant=nr,
n_informative=ni,
random_state=random_state,
n_clusters_per_class=2)
rng = np.random.RandomState(2)
X += 3 * rng.uniform(size = X.shape)
X = StandardScaler().fit_transform(X)
return X, y
(a) Generate a dataset using the default parameters of create dataset. Then, randomly split the dataset into training set Xtrain and test set Xtest (use the sklearn train test split function with random state=45), with 80 % of examples for training and 20 % for testing. Fit each of the

models to the training data and then plot the decision boundaries of each of the classifiers (using default parameter settings) on the test set. If you prefer, you may use the following plotter function which plots the decision boundary and works for any sklearn model.
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
def plotter(classifier, X, X_test, y_test, title, ax=None):
# plot decision boundary for given classifier
plot_step = 0.02
x_min, x_max = X[:, 0].min() – 1, X[:,0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = classifier.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap = plt.cm.Paired)
ax.scatter(X_test[:, 0], X_test[:, 1], c = y_test)
ax.set_title(title)
plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.scatter(X_test[:, 0], X_test[:, 1], c = y_test)
plt.title(title)
Note that you can use the same general approach for plotting a grid as used in Homework 1, and the plotter function supports an ‘ax’ argument. What to submit: a single 3 × 2 plot, a print screen of the python code used, a copy of your python code in solutions.py.
The students should generate a plot that looks like:

. The code used to generate this plot is:
3 4 5 6 7 8 9
10 11 12 13 14 15 16
X, y = create_dataset()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=45)
mods = [DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
LogisticRegression(),
MLPClassifier(),
titles = [“Decision Tree”, “RF”, “AdaBoost”,
“Logistic Regression”, “Neural Network”, “SVM”]
fig, ax = plt.subplots(3,2, figsize=(10,10))
for i, ax in enumerate(ax.flat):
mod = mods[i]

mod.fit(X_train, y_train)
plotter(mod, X, X_test, y_test, titles[i], ax)
plt.savefig(“figures/boundaries.png”)
plt.show()
17 18 19 20 21
(b) Next, we will study how the performance of each of the clasifiers varies as you increase the size of the training set. Fix your training and test sets from part (a). Then, starting from a random subset (no need to set a seed) of your training set of size 50 (chosen with replacement), train your classification model, and compute the accuracy on the test set. Repeat this process for training set sizes of [50, 100, 200, 300, . . . , 1000]. Repeat the experiment a total of 10 times for each classifier. Then, for each classifier, plot its average accuracy (achieved on the test set) for each training set size. Compare the accuracy across different algorithms in a single figure, and in 5 lines or less, discuss your observations: For the models covered in the course so far, use what you know about the bias-variance decomposition to inform your discussion. Which model you prefer for this task?
Please use the following color scheme for your plots: [Decision Tree, Random Forest, AdaBoost, Logistic Regression, Neural Network, SVM], and please include a legend in your plot. What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.
The figure generated is

Discussion points can include:
• trees doing poorly and exhibiting high variance, RF seems to be overfitting
• as expected MLPs and SVMs do quite well without much tuning
• logistic regression is preferred here as it the simplest model and achieves consistently good results.
Full marks should use the decision surfaces in (a) to argue about overfitting of RFs and Ad- aboost relative to the other models. Code is presented at end of part (c).
(c) Using the time module, record the training time for each of the classifiers at each of the training set sizes. Plot the log of the average training time over the 10 trials of each classifier as a function of the training size (use log base e). You may add code for this section to your code in part (b). What do you observe? In 5 lines or less, discuss your observations. Use the same color scheme as in (b). What to submit: a discussion of your observation, a single plot, a print screen of the python code used, a copy of your python code in solutions.py.
The figure generated is

of if log-scale;
Some basic comments about NNs taking more time as you increase the training size. Some discussion of the default MLP number of hidden layers would be good here but not necessary. Code used for both (b) and (c) is provided here:

def train_mod(mod, X_train, y_train, X_test, y_test, train_size):
idxs = np.random.choice(X_train.shape[0], size=train_size)
Xp = X_train[idxs]
yp = y_train[idxs]
mod.fit(Xp, yp)
return np.mean(mod.predict(X_test)==y_test)
mod_mean_accuracies = np.zeros(shape=(6, 12))
mod_mean_time = np.zeros(shape=(6, 12))
train_sizes = [25,50] + [100 + 100*i for i in range(10)]
for mod_idx, mod in enumerate(mods):
mean_accuracies = np.zeros(shape=(10, 12))
mean_times = np.zeros(shape=(10, 12))
for i in range(10):
for j, tr in enumerate(train_sizes):
tstart = time.time()
mean_accuracies[i, j] = train_mod(mod, X_train,
y_train, X_test,
tend = time.time()
y_test, tr)
mean_times[i,j] = tend – tstart
mod_mean_accuracies[mod_idx,:] = mean_accuracies.mean(axis=0)
mod_mean_time[mod_idx, :] = mean_times.mean(axis=0)
fig_size = plt.rcParams[“figure.figsize”]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams[“figure.figsize”] = fig_size
for i in range(6):
plt.plot(train_sizes, mod_mean_accuracies[i], label=titles[i])
plt.xlabel(“Training Size”)
plt.ylabel(“Average Accuracy”)
plt.legend()
plt.savefig(“figures/Accuracy.png”)
plt.show()
for i in range(6):
plt.plot(train_sizes, mod_mean_time[i], label=titles[i])
plt.xlabel(“Training Size”)
plt.ylabel(“Average Time”)
plt.legend()
plt.savefig(“figures/Timing.png”)
plt.show()
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
浙大学霸代写加微信 cstutorcs
Question 4
Refer to the following dataset containing a sample S of ten examples. Each example is described using two Boolean attributes A and B. Each is labelled (classified) by the target Boolean function.
A B Class 10+ 01- 11- 10+ 11- 11- 00+ 11+ 00+ 00-
(a) What is the entropy of these examples with respect to the given classification?
(b) What is the Information gain of attribute A on sample S above?
The classes are split equally (5 in each class), so the entropy is
H(S)=−p log p −p log p =−1log 1−1log 1=1. +2+−2−222222
If we split on A, we get 6 observations falling into A = 1, and they are equally split across the classes. The remaining 4 observations fall in to A = 0, and they are also equally split. So they both have entropy of 1. Therefore, the gain of A is
G(A)=H(S)− 6 ×1− 4 ×1=1−1=0. 10 10
In other words, A is useless.
(c) What is the information gain of attribute B on sample S above?
In this case, we see 5 observations each in the two regions B = 1 and B = 0. However, in B=1,wesee1positiveand4negatives,sotheentropyis−1log 1−4log 4 =0.7219.Inthe
region B = 0, we see 4 positives and 1 negative, so again the entropy is 0.7219. The gain of B is therefore
G(B)=H(S)− 5 ×0.7219− 5 ×0.7219=1−0.7219=0.2781. 10 10

(d) What would be chosen as the ‘best’ attribute by a decision tree learner using the ifnromation gain splitting criterion? Why?
(e) What are ensembles? Discuss one example in which decision trees are used in an ensemble.
We choose B as it has higher information gain.
One example is Bagging trees, for more discussion about this see here.

Question 5
Refer to the following data
x1 x2 y -2 -1 -1 2 -1 1 111 -1 -1 -1 321
(a) Apply the Perceptron Learning Algorithm with starting values w0 = 5, w1 = 1 and w2 = 1, and a learning rate η = 0.4. Be sure to cycle through the training data in the same order that they are presented in the table.
Running the Perceptron training algorithm results in the following iterations, where high- lighted columns signify that the weight vector is changing on that particular iteration. The
correct weights are therefore w0 = 3.8, w1 = 2.6, iteration 9.
w2 = 2.2. These weights first appear on w
[4.6, 1.8, 1.4]T [4.6, 1.8, 1.4]T [4.6, 1.8, 1.4]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [4.2, 2.2, 1.8]T [3.8, 2.6, 2.2]T [3.8, 2.6, 2.2]T [3.8,