EECE5644 Introduction to Machine Learning and Pattern Recognition Assignment 2

Question 1 (50%)
The probability density function (pdf) for a 2-dimensional real-valued random vector X is as follows: p(x) = p(L = 0)p(x|L = 0)+ p(L = 1)p(x|L = 1). Here L is the true class label that indicates which class-conditioned pdf generates the data.
Assume the class priors are p(L = 0) = 0.65 and p(L = 1) = 0.35. The class-conditional pdfs are also p(x|L = 0) = a1N(x|μ01,Σ01)+a2N(x|μ02,Σ02) and p(x|L = 1) = N(x|μ1,Σ1), where N (x | μ , Σ) is a multivariate Gaussian probability density function with mean vector μ and covariance matrix Σ. The parameters of the class-conditional Gaussian pdfs are a1 = a2 = 1/2, and:
􏰀3􏰁 􏰀2 0􏰁 􏰀0􏰁 􏰀1 0􏰁 􏰀2􏰁 􏰀1 0􏰁 μ01= 0 Σ01= 0 1 μ02= 3 Σ02= 0 2 μ1= 2 Σ1= 0 1
For numerical results requested below, generate the following independent datasets each con- sisting of iid samples from the specified data distribution, and in each dataset make sure to include the true class label for each sample.
• D20 consists of 20 samples and their labels for training; t rain
• D200 consists of 200 samples and their labels for training; t rain
• D2000 consists of 2000 samples and their labels for training; t rain
• D10K consists of 10000 samples and their labels for validation. valid
Part 1 (10%): Determine the theoretically optimal classifier that achieves minimum probabil-
ity of error using knowledge of the true pdf. Specify the classifier mathematically and implement
it; then apply it to all samples in D10K . From the decision results and true labels for this validation valid
set, estimate and plot the ROC curve of this min-Pr(error) classifier. Report the optimal threshold and probability error estimate of the theoretical min-Pr(error) classifier, indicating on the ROC curve with a special marker its location. Also report the empirical threshold and associated min- imum probability of error estimate for this classifier based on counts of decision-truth label pairs on D10K .
Optional (Bonus 2.5%): As supplementary visualization, generate a plot of the decision bound- ary of this classification rule overlaid on the validation dataset. This establishes an aspirational performance level on the dataset for the following approximations.
Part 2 (40%): (a) Using the maximum likelihood parameter estimation technique train three
separate logistic-linear approximations of class label posterior functions given a sample. For each
approximation use one of the three training datasets D20 , D200 , D2000. When optimizing the pa- t rain t rain t rain
rameters, specify the optimization problem as minimization of the negative-log-likelihood (NLL)
of the training dataset, and use your favorite numerical optimization approach, such as gradient
descent or Python’s optimize.minimize function in the scipy library. Determine how to use these
class-label-posterior approximations to classify a sample in order to approximate the minimum-
Pr(error) classification rule; apply these three approximations of the class label posterior function
on samples in D10K , and estimate the probability of error that these three classification rules will valid
attain (using counts of decisions on the validation set). 1

Programming Help, Add QQ: 749389476
Optional (Bonus 2.5%): As supplementary visualization, generate plots of the decision bound- aries of these trained classifiers superimposed on their respective training datasets and the valida- tion dataset.
(b) Repeat the process described in Part (2a) using a logistic-quadratic-function approximation of class label posterior functions given a sample. How does the performance of your classifiers trained in this part compare to each other considering differences in number of training samples and function form? How do they compare to the theoretically optimal classifier from Part 1? Briefly discuss results and insights.
Note: With x representing the input sample vector and w denoting the model weights (or pa- rameter vector), the logistic-linear-function refers to g(w⊺φ(x)) = 1/(1+e−w⊺φ(x)), where φ(x) = [1,x⊺]⊺ is your augmented input vector, also denoted x ̃ in lecture notes, and logistic-quadratic- function refers to g(w⊺φ(x)) = 1/(1+e−w⊺φ(x)), where φ(x) = [1,×1,x2,x12,x1x2,x2]⊺.
Question 2 (50%)
Assume that scalar-real y ∈ R and two-dimensional real vector x ∈ R2 are related to each other according to y = c(x, w) + ε , where c(., w) is a cubic polynomial in x with coefficients w, and ε is a random Gaussian scalar with mean zero and σ2-variance (ε ∼ N(0,σ2)).
Given a dataset D = {(x(1),y(1)),…,(x(N),y(N))} with N samples of (x,y) pairs that are in- dependent and identically distributed according to the model, derive two estimators for w using maximum-likelihood (ML) and maximum-a-posteriori (MAP) parameter estimation approaches. For the MAP estimator, assume that w has a zero-mean Gaussian prior with covariance matrix γ I.
Having derived the estimator expressions, implement them in code and apply them to the dataset generated by the attached Python/Matlab script. Using the training dataset (Ntrain = 100), obtain the ML estimator and the MAP estimator for a variety of γ values ranging from 10−4 to 104 (span a log scale). Evaluate each trained model by calculating the mean squared error (MSE) between the y values in the validation samples (Nvalid = 1000) and model estimates of these using c(.,wtrain). How does your MAP-trained model perform on the validation set as γ is varied? How is the MAP estimate related to the ML estimate? Describe your experiments, visualize and quan- tify your analyses with data from these experiments (e.g. plot MSE on the validation dataset as a function of hyperparameter γ).
Note: Point split will be 25% for ML and 25% for MAP estimator results.
Code Help, Add WeChat: cstutorcs