CIS4190 CIS5190 HW4 2023 code

CIS4190 CIS5190 HW4 2023

CIS 4190/5190 Homework 4 – Fall 2023¶
Before starting, you must click on the “Copy To Drive” option in the top bar. Go to File –> Save a Copy to Drive. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved

# Restart the runtime after running this cell everytime you open the notebook
#!pip install pandas==1.1.5

!pip install dill

Collecting dill
Downloading dill-0.3.7-py3-none-any.whl (115 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 2.3 MB/s eta 0:00:00
Installing collected packages: dill
Successfully installed dill-0.3.7

import random
import numpy as np
import pandas as pd
import sys
import matplotlib.pyplot as plt
from numpy.linalg import *
from scipy.spatial import distance
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
np.random.seed(42) # don’t change this line

import base64
import datetime

PennGrader Setup¶
First, you’ll need to set up the PennGrader, an autograder we are going to use throughout the semester. The PennGrader will automatically grade your answer and provide you with an instant feedback. Unless otherwise stated, you can resubmit up to a reasonable number of attempts (e.g. 100 attemptes per day). We will only record your latest score in our backend database.

After finishing each homework assignment, you must submit your iPython notebook to gradescope before the homework deadline. Gradescope will then retrive and display your scores from our backend database.

!pip3 install penngrader –upgrade

# do not need to consider any grader.grade

#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON’T KNOW WHO
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 00000000

# As described in class, we are collecting a dataset on student
# programming assignment submissions to aid research on designing
# ChatGPT-based programming tutoring tools. You can optionally agree
# to include your data in this dataset, which can be invaluable for
# research on AI for programming education. All data we collect will
# be fully anonymized and de-identified before being stored and
# publicly distributed. The results of this study have the potential
# to help students in future iterations of the class, as well as
# students learning to program more broadly, and we anticipate that
# the risks to participation are minimal. Your decision to participate
# is voluntary and will have no impact on your grade in this class,
# access to course resources, etc. In addition, even if you accept,
# you will be able to opt out anytime up until the end of the year
# (December 31, 2023). There will be no compensation for participation.
# You can reach out to me if you have any
# questions or concerns. Please set the following variable to True
# if you consent to data collection and False if you opt out:

data_collection_consent = False

Run the following cell to initialize the autograder. This autograder will let you submit your code directly from this notebook and immediately get a score.

NOTE: Remember we store your submissions and check against other student’s submissions… so, not that you would, but no cheating.

# Serialization code needed by the autograder
import inspect, sys
from IPython.core.magics.code import extract_symbols

def new_getfile(object, _old_getfile=inspect.getfile):
if not inspect.isclass(object):
return _old_getfile(object)

# Lookup by parent module (as in current inspect)
if hasattr(object, ‘__module__’):
object_ = sys.modules.get(object.__module__)
if hasattr(object_, ‘__file__’):
return object_.__file__

# If parent module is __main__, lookup by methods (NEW)
for name, member in inspect.getmembers(object):
if inspect.isfunction(member) and object.__qualname__ + ‘.’ + member.__name__ == member.__qualname__:
return inspect.getfile(member)
raise TypeError(‘Source for {!r} not found’.format(object))
inspect.getfile = new_getfile

def grader_serialize(obj):
cell_code = “”.join(inspect.linecache.getlines(new_getfile(obj)))
class_code = extract_symbols(cell_code, obj.__name__)[0][0]
return class_code

Next, we will download the dataset from Google Drive to your local runtime. After successful download, you may verify that all datasets are present in your colab instance.

observations.csv
test_student.csv

Acknowledgement¶
Dataset obtained from kaggle.com Hourly Weather Surface – Brazil (Southeast region)

import gdown
if not os.path.exists(“observations.csv”):
!gdown –id 1RvNTrL147Cx90ABv4IfXcexaRyHB-U-e
if not os.path.exists(“test_student.csv”):
!gdown –id 1Z0I6iylDgTk2OKuKDaQVR9I1aJvgRsn_

/usr/local/lib/python3.10/dist-packages/gdown/cli.py:121: FutureWarning: Option `–id` was deprecated in version 4.3.1 and will be removed in 5.0. You don’t need to pass it anymore to use a file ID.
warnings.warn(
Downloading…
From: https://drive.google.com/uc?id=1RvNTrL147Cx90ABv4IfXcexaRyHB-U-e
To: /content/observations.csv
100% 13.5M/13.5M [00:00<00:00, 60.1MB/s] /usr/local/lib/python3.10/dist-packages/gdown/cli.py:121: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID. warnings.warn( Downloading... From: https://drive.google.com/uc?id=1Z0I6iylDgTk2OKuKDaQVR9I1aJvgRsn_ To: /content/test_student.csv 100% 277k/277k [00:00<00:00, 3.74MB/s] NOTE 1. If you are running into a `builtins' error, it's likely because you're using a function call of the form numpy.ndarray.mean(), like a.mean(). This does not play nice with PennGrader unfortunately. Please use the function call numpy.mean(a) instead.¶ def prepare_final_cleaned_df(df): df = df.drop(["Unnamed: 0"], axis=1) df["mdct"] = pd.to_datetime(df["mdct"]) df.loc[df["gust"].isna(),"gust"] = 0 df.loc[df["gbrd"].isna(),"gbrd"] = 0 df.loc[df["wdsp"].isna(),"wdsp"] = 0 df.loc[df["dewp"].isna(),"dewp"] = 0 df.loc[df["dmin"].isna(),"dmin"] = 0 df.loc[df["dmax"].isna(),"dmax"] = 0 df = df[df["temp"] != 0] df = df.drop(columns=["prcp"]) left_df = df.copy() right_df = df.copy() right_df["mdct"] = right_df["mdct"].apply(lambda x : x + datetime.timedelta(hours=1)) columns = ["stp", "smax", "smin", "gbrd", "dewp", "tmax", "dmax", "tmin", "dmin", "hmdy", "hmax", "hmin", "wdsp", "wdct", "gust", "temp"] merged_df = pd.merge(left_df, right_df, "left", on=["wsid","mdct"], indicator=True) merged_df = merged_df[merged_df['_merge'] == "both"] columns_x = [x + "_x" for x in columns] columns_y = [x + "_y" for x in columns] merged_df[columns] = merged_df[columns_x].values - merged_df[columns_y].values merged_df = merged_df.drop(columns=columns_x) merged_df = merged_df.drop(columns=columns_y) merged_df = merged_df.drop(columns=["_merge", "mdct", "wsid"]) final_cleaned_df = merged_df.copy() final_cleaned_df.loc[final_cleaned_df["temp"] >= 0, “temp” ] = 1
final_cleaned_df.loc[final_cleaned_df[“temp”] < 0, "temp" ] = 0 return final_cleaned_df 1. [25 pts, autograded] AdaBoost¶ 1.1. [3 pts] Logistic regression with sample weights¶ As you will have learnt from the lectures, AdaBoost fits weak learners (here, logistic regression model) in each iteration, to a dataset with weights $w_i$ attached to each sample $(x_i, y_i)$. The loss function now becomes: \mathcal{L}({\theta}) = -\sum_{i =1}^N w_{i} \times [ y_i\log(h_{{\theta}}({x}_i)) + (1 - y_i)\log(1 - h_{{\theta}}({x}_i))] where $h_\theta(x)$ is the logistic regression hypothesis function. The gradient of this weighted loss function with respect to the weight $\theta_j$ is given by: $\frac{\partial \mathcal{L}({\theta})}{\partial \theta_j} = \sum_{i=1}^{N}w_{i}(h_{{\theta}}({x}_i) - y_i)x_{ij}$ Using this information, complete the compute_gradient method in the LogisticRegression class to account for sample weights. class LogisticRegression: Logistic Regression (aka logit, MaxEnt) classifier. Parameters ---------- alpha: float, default=0.1 Learning rate tol : float, default=0.01 Tolerance for stopping criteria max_iter : int, default=1000 Maximum number of iterations of gradient descent Attributes ---------- theta_ : numpy.ndarray of shape (D + 1,) The value of the coefficients after gradient descent has converged or the number of iterations hit the maximum limit converged_: boolean Boolean value indicating whether gradient descent converged or not def __init__(self, alpha=0.1, tol=0.01, max_iter=1000): self.alpha = alpha self.tol = tol self.max_iter = max_iter self.theta_ = None self.converged_ = False def compute_gradient(self, theta, X, y, sample_weight): Compute the gradient of the cost function. Parameters ---------- theta: numpy.ndarray of shape (D + 1,) The coefficients X: numpy.ndarray of shape (N, D + 1) The features matrix y: numpy.ndarray of shape (N,) The target variable array sample_weight: numpy.ndarray of shape (N,) The sample weight array gradient: numpy.ndarray of shape (D + 1,) The gradient values sigmoid = lambda x: 1 / (1 + np.exp(-x)) y_hat = sigmoid(X.dot(theta)) # STUDENT TODO: Compute the gradient # STUDENT TODO END def fit(self, X, y, sample_weight): Compute the coefficients using gradient descent and store them as theta_. Parameters ---------- X: numpy.ndarray of shape (N, D) The features matrix y: numpy.ndarray of shape (N,) The target variable array sample_weight: numpy.ndarray of shape (N,) The sample weight array N, D = X.shape # Adding a column of ones at the beginning for the bias term ones_col = np.ones((N, 1)) X = np.hstack((ones_col, X)) # Initializing the weights theta_old = np.zeros((D + 1,)) theta_new = theta_old.copy() for i in range(self.max_iter): theta_new = theta_old - self.alpha * self.compute_gradient(theta_old, X, y, sample_weight) if np.linalg.norm(theta_new - theta_old) / (np.linalg.norm(theta_old) + self.tol) <= self.tol: self.converged_ = True theta_old = theta_new.copy() self.theta_ = theta_new def predict_proba(self, X): Predict the probabilities that the data points in X belong to class 1. Parameters ---------- X: numpy.ndarray of shape (N, D) The features matrix y_hat: numpy.ndarray of shape (N,) The predicted probabilities that the data points in X belong to class 1 N = X.shape[0] # Adding a column of ones at the beginning for the bias term ones_col = np.ones((N, 1)) X = np.hstack((ones_col, X)) sigmoid = lambda x: 1 / (1 + np.exp(-x)) y_hat = sigmoid(X.dot(self.theta_)) return y_hat def predict(self, X): Predict the classes of the data points in X. Parameters ---------- X: numpy.ndarray of shape (N, D) The features matrix y_pred: numpy.ndarray of shape (N,) The predicted class of the data points in X y_hat = self.predict_proba(X) y_pred = y_hat.copy() y_pred[y_pred >= 0.5] = 1
y_pred[y_pred < 0.5] = 0 return y_pred Test case for the compute_gradient method¶ def test_compute_gradient(StudentLogisticRegression): student_lr_clf = StudentLogisticRegression() np.random.seed(0) theta_tc = np.random.randn(2) X_tc = np.random.randn(100, 2) y_tc = np.random.randint(0, 2, 100) sample_weight_tc = np.random.uniform(0, 1, 100) student_ans = student_lr_clf.compute_gradient(theta_tc, X_tc, y_tc, sample_weight_tc) required_ans = np.array([12.903225675830651, -1.0829605960182223]) assert np.linalg.norm(student_ans - required_ans) < 1e-2 * required_ans.size test_compute_gradient(LogisticRegression) # PennGrader Grading Cell if data_collection_consent: requests.post(URL, json={"student": STUDENT_ID, "question": "H4Q1a", "solution": grader_serialize(LogisticRegression)}) grader.grade(test_case_id = 'test_compute_gradient_autograder', answer = grader_serialize(LogisticRegression)) 1.2. [20 pts] AdaBoostClassifier Implementation¶ In this section, you will be implementing the AdaBoost classifier with the logistic regression weak learner from above. 1.2.1. [12 pts] Follow the hints in the fit method in the AdaBoostClassifier class to implement the following algorithm.¶ Use the following Adaboost pseudocode as a reference. training data $X, y = \{(x_{i}, y_{i})\}_{i=1}^N$ number of iterations $T$ ALGORITHM: Initialize $N$ uniform weights, i.e., $w_{1} = [1/N, 1/N, ..., 1/N]$ For $t = 1, 2, ... T$: 2.1. Train model $h_t$ on $X$ and $y$ with instance weights $w_{t}$ 2.2. Compute the weighted training error rate of $h_{t}$: $\epsilon_{t} = \sum_{i: y_i \ne h_t(x_i)} w_{t,i}$ 2.3. If $\epsilon_{t} > 0.5$, flip $h_t$’s predictions

2.4. Set $\beta_{t} = \frac{1}{2}\text{ln}\left(\frac{1 – \epsilon_t}{\epsilon_t}\right)$

2.5. Update all instance weights: $w_{t + 1,i} = w_{t,i}\times\text{exp}(-\beta_{t}y_{i}h_{t}(x_{i}))$ $\forall i = 1, 2, …, N$

2.6. Normalize $w_{t+1}$ such that the elements sum to 1

1.2.2. [8 pts] Follow the hints in the predict method in the AdaBoostClassifier class for obtaining the predictions of the trained AdaBoost classifier.¶
$H(x) = \text{sign}\left(\sum_{t=1}^{T}\beta_{t}h_{t}(x)\right)$

class AdaBoostClassifier:
AdaBoost classifier based on logistic regression

Parameters
———-
T: int, default=100
The number of logistic regression models in the boosting model

Attributes
———-
beta_arr_ : list of length T
The list of beta values in the boosting model

h_arr_: list of length T
The list of logistic regression models in the boosting model

def __init__(self, T=100):

self.T = T

self.beta_arr_ = []
self.h_arr_ = []

def fit(self, X, y):
Train the logistic regression models (h) and compute their coefficients (beta),
and store them in h_arr_ and beta_arr_ respectively.

Parameters
———-
X: numpy.ndarray of shape (N, D)
The features matrix
y: numpy.ndarray of shape (N,)
The target variable array

N = X.shape[0]

# STUDENT TODO: Initialize w with appropriate values
# STUDENT TODO END

y_ = y.copy()
# STUDENT TODO: Update y_ such that the 0’s in y_ are replaced by -1
# STUDENT TODO END

for t in range(self.T):
h = LogisticRegression()

# STUDENT TODO:
# Fit h to X and y using w as the sample weights

# Obtain the predictions from h and compute epsilon

# If epsilon > 0.5:
# 1. flip the predictions, i.e., replace 1’s with 0’s and 0’s with 1’s
# 2. invert the model (h), i.e., make it predict 1 for what it predicted 0 earlier and vice-versa (clue: think about modifying h.theta_)

# STUDENT TODO END

self.h_arr_.append(h)

if epsilon == 0:
beta = np.inf
self.beta_arr_.append(beta)

# STUDENT TODO: Compute beta
# STUDENT TODO END

self.beta_arr_.append(beta)
y_pred_ = y_pred.copy()

# STUDENT TODO:
# Update y_pred_ such that the 0’s in y_pred_ are replaced by -1

# Update w and normalize it such that the values in w sum to 1

# STUDENT TODO END

def predict(self, X):
Predict the classes of the data points in X.

Parameters
———-
X: numpy.ndarray of shape (N, D)
The features matrix

y_pred: numpy.ndarray of shape (N,)
The predicted class of the data points in X

N = X.shape[0]

# Initialize the summation of beta times h for each x_i
sum_beta_times_h = np.zeros((N,))

for t in range(len(self.h_arr_)):

# STUDENT TODO:
# Obtain the predictions of the t-th model in self.h_arr_

# Replace the 0’s in the array with -1

# Update sum_beta_times_h

# Create an array `y_pred` for the final predictions
# Fill 0’s and 1’s in `y_pred` depending on the sum_beta_time_h value in the corresponding location

return y_pred
# STUDENT TODO END

Test case for the fit method¶

def test_adaboost_fit(StudentAdaBoostClassifier):

student_ab_clf = StudentAdaBoostClassifier(T=T)
np.random.seed(0)
X_tc = np.random.randn(N, D)
y_tc = np.random.randint(0, 2, N)
student_ab_clf.fit(X_tc, y_tc)

beta_arr_student_ans = student_ab_clf.beta_arr_
beta_arr_required_ans = np.array([0.08017132503758954, 0.046732864002838985,
0.022808008179707476, 0.07012335626140642])
assert np.linalg.norm(beta_arr_student_ans – beta_arr_required_ans) < 1e-2 * beta_arr_required_ans.size h_arr_student_ans = np.zeros([T, D + 1]) for indx, h in enumerate(student_ab_clf.h_arr_): h_arr_student_ans[indx] = h.theta_ h_arr_required_ans = np.array([[-0.01514967, -0.01713051, 0.21344566], [-0.01738886, -0.00656722, 0.12035635], [-0.0132557, -0.00428943, 0.06616284], [-0.01037174, -0.00334141, 0.03943088]]) assert np.linalg.norm(h_arr_student_ans - h_arr_required_ans) < 1e-2 * h_arr_required_ans.size test_adaboost_fit(AdaBoostClassifier) # PennGrader Grading Cell if data_collection_consent: requests.post(URL, json={"student": STUDENT_ID, "question": "H4Q1b", "solution": grader_serialize(LogisticRegression) + "\n" + grader_serialize(AdaBoostClassifier)}) grader.grade(test_case_id = 'test_adaboost_fit_autograder', answer = (grader_serialize(LogisticRegression), grader_serialize(AdaBoostClassifier))) Test case for the predict method¶ def test_adaboost_predict(StudentAdaBoostClassifier): student_ab_clf = StudentAdaBoostClassifier(T=T) np.random.seed(0) X_tc = np.random.randn(N, D) y_tc = np.random.randint(0, 2, N) student_ab_clf.fit(X_tc, y_tc) student_ans = student_ab_clf.predict(X_tc) required_ans = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1] assert np.mean(student_ans == required_ans) >= 0.98

test_adaboost_predict(AdaBoostClassifier)

# PennGrader Grading Cell
if data_collection_consent:
requests.post(URL, json={“student”: STUDENT_ID, “question”: “H4Q1c”, “solution”: grader_serialize(LogisticRegression) + “\n” + grader_serialize(AdaBoostClassifier)})
grader.grade(test_case_id = ‘test_adaboost_predict_autograder’, answer = (grader_serialize(LogisticRegression), grader_serialize(AdaBoostClassifier)))

1.3. [2 pts] AdaBoost on the dataset¶
Follow the hints in the adaboost_on_dataset method in the below cell to run AdaBoostClassifier on the dataset.

# Load Dataset
df = pd.read_csv(“observations.csv”)
final_cleaned_df = prepare_final_cleaned_df(df)

test_df = pd.read_csv(“test_student.csv”).drop(columns=[“Unnamed: 0”])

def adaboost_on_dataset():
Trains the AdaBoostClassifier on a real-world dataset.

Parameters
———-

y_test_pred: numpy.ndarray
The predicted classes of the datapoints in test_df

# STUDENT TODO START: Initialize X_train and y_train with appropriate values (clue: use .iloc followed by .values of the DataFrame class)

# STUDENT TODO END

# STUDENT TODO START: Initialize X_test
# STUDENT TODO END

scaler = StandardScaler()
# STUDENT TODO START: Scale the features of X_train and X_test using scaler

# STUDENT TODO END

clf = AdaBoostClassifier(T=10)
# STUDENT TODO START: Now fit clf to the entire training data, i.e., X_train and y_train after feature scaling
# STUDENT TODO END

# STUDENT TODO START: Predict the classes of the datapoints in X_test and return the result
# STUDENT TODO END

# PennGrader Grading Cell
y_test_pred = adaboost_on_dataset()
grader.grade(test_case_id = ‘test_adaboost_on_dataset_autograder’, answer = y_test_pred)

2. [8 pts, 419-optional, autograded] XGBoost¶

TODOs for this section:¶
You’ll use xgboost library to build a classifier for the above problem. XGBoost is a popular library for gradient boosting, and you can find its documentation here.
You need to get at least 0.75 accuracy on the test set to receive full credits from the autograder.

train_df = final_cleaned_df.copy()
test_df = pd.read_csv(“test_student.csv”).drop(columns=[“Unnamed: 0″])

import xgboost as xgb

# STUDENT TODO STARTS:
# 1. Fit an xgboost classifier to the training data (train_df, target variable is temp)
# 2. Obtain the predictions of the trained model on test_df in the variable y_test_pred
# 3. Tune the hyperparameters such that you pass the autograder threshold accuracy of 0.75

# STUDENT TODO ENDS

# PennGrader Grading Cell
grader.grade(test_case_id = ‘test_xgboost’, answer = y_test_pred)

3. [25 pts, 15 autograded, 10 manually graded] K-means Clustering¶

We will implement the k-means clustering algorithm using the Breast Cancer dataset. As with all unsupervised learning problems, our goal is to discover and describe some hidden structure in unlabeled data. The k-means algorithm, in particular, attempts to determine how to separate the data into k distinct groups over a set of features given that we know (are provided) the value of k.

Knowing there are k distinct ‘classes’ however, doesn’t tell anything about the content/properties within each class. If we could find samples that are representative of each of these k groups, then we could label the rest of the data based on how similar they are to each of the prototypical samples. We will refer to these representatives as the centroids (cluster centers) that correspond to each cluster.

3.1. Import the dataset¶

from sklearn.datasets import load_breast_cancer
cancer_dataset = load_breast_cancer()

## TODO your code here ##
First load the dataset X from cancer_dataset.
X – (m, n) -> m x n matrix where m is the number of training points = 569 and n is the no of features = 30
## TODO end ##

3.2. [12 pts] K-means clustering implementation¶

We will first implement a class for k-means clustering.

These are the main functions:

__init__: The initialiser/constructor (This is implemented for you)
fit: Entrypoint function that takes in the dataset (X) as well as centroid initialisations and returns: the cluster labels for each row (data point) in the dataset
list of centroids corresponding to each cluster
no of iterations taken to converge.

Inside fit() function, you will need to implement the actual kmeans functionality.

The K-means process you should follow is listed below:

Initialize each of the k centroids to a random datapoint if initialisation is not provided.
Update each datapoint’s cluster to that whose centroid is closest
Calculate the new centroid of each cluster
Repeat the previous two steps until no centroid value changes. Make sure you break out of the loop reagrdless of whether you converged or not once max iterations are reached.

To help streamline this process, three helper functions have been given to you in the KMeans class \

compute_distance(): use for step-2 above
find_closest_cluster(): use for step-2 above
compute_centroid(): use for step-3 above

class KMeans:
”’Implementing Kmeans clustering”’

def __init__(self, n_clusters, max_iter=1000):
self.n_clusters = n_clusters
self.max_iter = max_iter

def compute_centroids(self, X, clusters):
Computes new centroids positions given the clusters

X – m by n matrix, where m is the number of training points
clusters – m dimensional vector, where m is the number of training points
At an index i, it contains the cluster id that the i-th datapoint
in X belongs to.

centroids – k by n matrix, where k is the number of clusters.
centroids = np.zeros((self.n_clusters, X.shape[1]))
# TODO your code here

## TODO end ##
return centroids

def compute_distance(self, X, centroids):
Computes the distance of each datapoint in X from the centroids of all the clusters

X – m by n matrix, where m is the number of training points
centroids – k by n matrix, where k is the number of clusters

dist – m by k matrix, for each datapoint in X, the distances from all the k cluster centroids.

dist = np.zeros((X.shape[0], self.n_clusters))
# TODO your code here

## TODO end ##
return dist

def find_closest_cluster(self, dist):
Finds the cluster id that each datapoint in X belongs to

dist – m by k matrix, for each datapoint in X, the distances from all the k cluster centroids.

clusters – m dimensional vector, where m is the number of training points
At an index i, it contains the cluster id that the i-th datapoint
in X belongs to.

clusters = np.zeros(dist.shape[0])
# TODO your code here
## TODO end ##
return clusters

def fit(self, X, init_centroids=None):
Fit KMeans clustering to given dataset X.

X – m by n matrix, where m is the number of training points
init_centroids (optional) – k by n matrix, where k is the number of clusters

clusters – m dimensional vector, where m is the number of training points
At an index i, it contains the cluster id that the i-th datapoint
in X belongs to.
centroids – k by n matrix, where k is the number of clusters.
These are the k cluster centroids, for cluster ids 0 to k-1
iters_taken – total iterations taken to converge. Should not be more than max_iter.

# Fix random seed. Do not change this!
np.random.RandomState(111)

## TODO your code here ##
# Initialise centroids to random points in the dataset if not provided (i.e. None)

# Iterate until kmeans converges or max_iters is reached. In each iteration:
# – Update each datapoint’s cluster to that whose *centroid* is closest
# – Calculate the new *centroid* of each cluster
# – Repeat the previous two steps until no centroid value changes.

## TODO end ##
return self.clusters, self.centroids, iters_taken

# test case centroids should be aroudn (1.5,1.5) and (4.5,4.5)
points = []
result = []
random.seed(0)
for _ in range(500):
x = random.random()*3
y = random.random()*3
points.append((x,y))
result.append(0)
for _ in range(500):
x = random.random()*3 + 3
y = random.random()*3 + 3
points.append((x,y))
result.append(1)
clf = KMeans(2)
points = np.asarray(points)

#test for sanity check
def test_compute_centroids():
clf = KMeans(2)
centroid_p = clf.compute_centroids(np.array(points),np.array(result))
centroid_r = [[1.5185255, 1.45970038],
[4.51568108,4.54138552]]
assert(np.linalg.norm(centroid_p – np.array(centroid_r)) <= 1e-2 ) test_compute_centroids() # PennGrader Grading Cell if data_collection_consent: requests.post(URL, json={"student": STUDENT_ID, "question": "H4Q3a", "solution": grader_serialize(KMeans)}) grader.grade(test_case_id = 'test_compute_centroids', answ