COMP4670 COMP8600 Statistical Machine Learning

The Australian National University
School of Computing
Semester 1, 2023 Assignment 1 Weight: 18%
COMP4670/8600: Statistical Machine Learning
Release Date. 27th Feburary 2023.
Due Date. 27th March 2023 at 1200 AEDT.
Maximum credit. 100 Marks for COMP4670 and 120 Marks for COMP8600.
For submission, we are using Ed. Check under the submission tab and make sure to follow instructions on what to submit. Although using Ed for submission, we recommend that you work on the assignment locally on your computer as we will not be providing display.ipynb on Ed.
If you want to submit in LATEX, we have a general template which you can use: https://quicklink.anu.edu.au/a4me.
In addition to the files which you need to submit, we have provided an display.ipynb Jupter Notebook which you can use to debug the functions that you implement. A correct implementation should run the notebook from start to finish without any errors. DO NOT change any files in framework/..
Do not submit:
• The data/. folder;
• The framework/. folder;
• The Jupyter Notebook display.ipynb.
Grading is different for COMP4670 and COMP8600 students. Grades will be calculated as a percentage out of 100 or 120. Questions which are required for COMP8600 students are marked in their titles. These are optional for COMP4670 students. For COMP4670, their final grade will be taken as the maximum percentage of only COMP4670 questions vs all questions.
Installation can be done by using requirements.txt provided. For a fresh install, create a new Python environment (e.g., via Conda) and then run pip install -r requirements.txt in said environment:
$ conda create -n sml
$ conda activate sml
$ pip install -r requirements.txt
Collaboration, you are free to discuss the material in the assignment with your classmates. However, every proof and code submitted must be written by yourself without assistant. You should be able to explain your answers if requested.
Regrade requests will be processed via a Microsoft Form. This will be released closer to the due date. Other notes:
• When writing proofs, use the equation numbers when referring the equations in the assignment, i.e., “Through Equation (3.1) we can show …”;
• You are not required to complete the assignment linearly, you may want to solve which-ever ques- tions you can regardless of order of appearance;
• If you have trouble with proving a statement, try reducing dimensionality of the problem first;
• Code is only graded on correctness of functions implemented, not the output of display.ipynb.

Section 1: Bayesian Thinking (5 Total Points)
A historic grouping of machine learning methods comes in the form of non-Bayesian vs. Bayesian meth- ods. You will notice that the first half of the course (and much of the Bishop’s textbook) follows a pattern of introducing a non-Bayesian model / algorithm and subsequently extendeding it to a Bayesian counter-part, e.g., Logistic Regression to Bayesian Logistic Regression; or Kernel Regression to Gaussian Processes. Although much of the course focuses on mathematical detail, for this first question we will consider some of the “whys” of using Bayesian methods in the first place.
Figure 1: From http://yann.lecun.com/ex/fun/index.html#allyourbayes.
Let’s first refresh on the application of Bayes’ Rule to models and data in ML. Given a training dataset
D and a model θ, the posterior distribution is defined as “likelihood” “prior”
􏰊 􏰍􏰌 􏰋 􏰊􏰍􏰌􏰋
Pr(θ|D)= Pr(D|θ)·Pr(θ) ∝Pr(D|θ)·Pr(θ). 􏰌 􏰋􏰊 􏰍 Pr(D)
“posterior”
On the right-hand side, Pr(D) is ignored and interpreted as a normalization constant for the posterior. Bayes’ Rule incorporates some amount of prior knowledge into a model and we can consider using maxi- mizing the posterior rather than the likelihood in parameter fitting. With this interpretation in mind, we should think about when a non-Bayesian method and a Bayesian method are “equivalent.”
Question 1.1: Am I Bayes? (1 Points)
Consider using two methods—maximum likelihood (MLE, non-Bayesian) and maximum a pos- teriori (MAP, Bayesian)—to estimate the parameters of some probability distribution from the same dataset and parameter space. When would the resulting estimates be equal?
Finding the MAP estimate consists in solving the optimization problem
max Pr(θ | D), (1.2)
where Θ is the set of all possible values for θ. Sometimes, finding or approximating the solution to Problem (1.2) is straightforward. Unfortunately, most of the time it is not. For the rest of this question, we will make the following assumption.
Assumption: For the problem maxθ∈Θ Pr(θ | D), we assume there exists no method for finding or numerically approximating (with high enough precision) its optimal solution. However, we have access to N > 0 i.i.d. samples {θi ∼ Pr(· | D)}Ni=1 and their corresponding probabilities Pr(θi | D),∀i = 1,…,N.
You might notice from the Bayesian Regression lecture that, by the properties of Gaussian distributions, we can find the MAP estimator for θ if sampling from the posterior Pr(θ | D) is possible. That is, assuming

that (i) the posterior is Gaussian (∗) and (ii) we have N i.i.d. samples from Pr(θ | D), we have
( ∗ ) 1 􏰇N
argmaxPr(θ | D)= E [θ] ≈ θi. (1.3)
θ θ|D N i=1
The sample mean on the right-hand side provides a convenient method of approximating the “best”
Now consider Pr(θ | D) is not Gaussian. A tantalizing question emerges: When is it suitable to approxi- mate the MAP estimator using a sample mean?
Question 1.2: Approximating the MAP (2 Points) Consider a posterior distribution Pr(θ | D) supported on Θ and N i.i.d. samples {θi ∈ Θ}Ni=1 from it. Define the sample mean as N1 􏰅Ni=1 θi.
1. For what posterior distributions would it be appropriate to use N1 􏰅Ni=1 θi to approximate arg maxθ∈Θ Pr(θ | D)?
2. Give an example of a posterior distribution for which N1 􏰅Ni=1 θi would be a very poor approximate of arg maxθ∈Θ Pr(θ | D)?
Even in the general case where (∗) does not hold, with access to both the samples θi and probabil- ities Pr(θi | D), we can approximate the optimal solution simply by finding the sample maximum: arg maxi=1,…,N Pr(θi | D).
… But should we?
One topic in machine learning which has recently garnered attention is conformal prediction.1 The rise in the method’s popularity is thanks to its ability to characterize the uncertainty of predictions through a confidence interval. Of course, a Bayesian model should naturally allow us to quantify the uncertainty of a “best” model as we are working with distributions.
Question 1.3: Being Uncertain (2 Points)
For the sample mean N1 􏰅Ni=1 θi and sample maximum arg maxi=1,…,N Pr(θi | D), discuss how we might (or might not be able to) quantify the uncertainty of the parameter estimate θ given a fixed set of N samples. In other words, is it possible to establish “error bars” for these estimates?
Remark. On a more philosophical note, given a posterior Pr(θ | D), we may want think about what constitutes a “best” model θ⋆. Do we want one that maximizes the posterior, one equal to the average model, or maybe something else like the median? In different scenarios, this definition of “best” can be radically different.
1See, for instance: https://neurips.cc/virtual/2022/invited-talk/55872 3
程序代写 CS代考加QQ: 749389476
Section 2: Exponential Families (45/65 Total Points)
For this question, we will explore a unified family of distributions: the exponential family (not to be confused with an exponential distribution). The exponential family provides a generalization many com- monly used distribution (Gaussian distribution, Beta distributions, Binomial distributions, etc.) and has many interesting properties which are used throughout ML.
First, let us define the exponential family distributions we will be considering in this question.
Definition 1 (Exponential Family2). Given a function u : Rn → Rm, we denote an n-dimensional expo- nential family distribution as Exp(u, η), where η ∈ P ⊂ Rm designates the m-dimensional parameters of the distribution within an exponential family3. The corresponding densities of the distributions are given by
q(z | η) = exp(η⊤u(z) − ψ(η)), (2.1) where 􏰈
exp(η⊤u(z)) dz. (2.2) the log-partition function of the exponential family. Sometimes, it will be convenient to define Z(η) =
ψ(η) = log
The function u is called the sufficient statistics of the exponential family and the function ψ is called
exp(η⊤u(z)) dz = exp(ψ(η)) as the normalizer of the distribution.
Remark. We note that the definite integral in Eq. (2.2) is over the domain of z. In this assignment, the domain of integration will simply be over Rn; which can be simplified to integrating over (−∞,∞) for different coordinates.
Remark. Notice that Definition 1 represent a broad set of special cases of the definition given in the Bishop textbook Eq. (2.194), with h(x) = 1 and g(η) = exp(−ψ(η)). For notational convenience in the tasks specified below, we use this definition.
At this point, it is useful to verify that the Exponential family indeed generalizes distributions we are familiar with.
Question 2.1: Gaussian R.V. as an Exponential Family (10 Points) Let u(z) = (z,z2) and η = (μ/σ2,−1/2σ2) define an (n = 1)-dimensional exponential family
distribution Exp(u, η) over R.
Verify that Exp(u, η) is equivalent to the 1-dimensional Gaussian distribution with mean μ and
variance σ2.
What is the distribution’s normalizer Z(η)?
One interesting aspect of Exponential families is its connections to loss functions in ML one typically finds. The simplest connection comes from ‘equating’ a loss function minimization procedure to the MLE of an exponential family, i.e.,
arg min “A loss function l” = arg max log q(z | η). (2.3) Of course, to do this, one needs to find the correct matching between losses and exponential family. To
do so, let us introduce some data.
Assumption: Let us assume that we have data D = {(xi,yi)}Ni=1 with features xi ∈ Rn and corresponding labels / targets yi ∈ R.
2This is actually a simplified version of an exponential family. A more general version can be defined, i.e., non-unit base measure.
3The exponential family for a fixed u is the set of distributions given by Mu = {Exp(u,η) | η ∈ P}. Two exponential family distributions share the same exponential family if and only if their u’s are the same.

Question 2.2: Exponential Families and Losses (5 Points) Let f(x;w) = w⊤x with weights w ∈ Rn. Let us define a conditional random variable using a
1-dimensional exponential family, such that
y | x ∼ Exp(u, η),
where u(y) = (y, y2) and η = η(x) = (f(x; w), −1/2).
That is, as per Definition 1, we are defining the following conditional p.d.f. such that:
p(y | x; w) = q(y | η(x)) = exp(η⊤(x)u(y) − ψ(η(x))). NN
argmax􏰇logp(yi |xi;w)=argmin􏰇(yi −f(xi;w))2. w∈Rn i=1 w∈Rn i=1
Remark. One might notice that in Question 2.2, the function f and its corresponding weights w does not effect the maximization / minimization. Thus such an observation can be identically made when consider f as a neural network whilst optimizing over weights W1 , . . ..
In the general case, optimizing either Eq. (2.6) analytically (‘by hand’) can be difficult, e.g., when f is a neural network. As such, it is useful to consider the gradients of the log-likelihood function.
Question 2.3: Parameterizing an Exponential Families (15 Points) Let f(x;θ) be some mapping from feature x parameterized by θ. Let us define a conditional
random variable using an exponential family, such that
y | x ∼ Exp(u, η), (2.7)
where η = f(x;θ) for any function u : Rn → Rm (we hide the dependence of x in η to reduce notation). We also define p(y | x; θ) similar to Eq. (2.5).
1. Show that 􏰀 ∂ψ(η)􏰀
= E [u(y)]. p(y|x)
2. Show that
′ 􏰂⊤􏰁∂f(x;θ)􏰂 ∂θ p(y′|x) ∂θ
∂η 􏰀η=f (x;θ)
∂logp(y|x;θ) 􏰁
= u(y)− E [u(y)] .
Eq. (2.9) allows us to use gradient ascent (confer, gradient descent) to iteratively update the parameters of our model. Of course, we can practically use Eq. (2.9) by using auto-differentiation packages in Python. However, one will need to do some tricks to make the computation efficient.

Question 2.4: Implementing Gradient Ascent
In ‘exponential_family.py’ complete the following functions: • grad_psi_wrt_eta (Eq. (2.8));
• log_likelihood_grad (Eq. (2.9));
and thus train a neural network in display.ipynb for the MNIST dataset.
You will need to use from torch.autograd.functional import jacobian, vjp.
Hint: You may want take a look at framework/exp_fam_model.py (or the copied comment in exponential_family.py).
You will not be graded on optimality of the neural network’s performance, only the correctness of implemented functions.
In the next few questions, we will be considering basic fairness properties of the exponential family distributions. There are many different criteria for fairness in ML, and rightfully so as fairness needs to take into account the surrounding social context. We will be considering a simple form of data fairness, where we will measure the representation of different sensitive groups (e.g., gender, age, race, etc.).
In the sequel, we will consider n-dimensional distribution, where the first (n − 1)-dimensions consists of non-sensitive features x ∈ X (those which do not affects fairness) and the final dimension consisting of a sensitive feature s ∈ S.
(15 Points)
Assumption: We will assume in the following questions that S is a finite domain. Definition 2. Given a p.d.f. p(x, s), its representation rate is given by:
RR(p)= min RR(p,s,s′)∈[0,1], s,s′ ∈S
where RR(p, s, s′ ) = p(s)/p(s′ ) (with p(s) as the marginal distribution of p(x, s)).
Representation rate simply measure how well the balance of sensitive features are being represented by
Question 2.5 [COMP8600]: Verifying Representation Rate (5 Points)
We have maximal fairness when RR(p) = 1 and minimal fairness when RR(p) = 0.
Argue why this definition of fairness make sense in terms of marginal probabilities of demographic
information p(s).
Thus to study the representation rate of (separable) exponential families, we should consider what the sensitive marginal distributions look like.
Github
Question 2.6 [COMP8600]: Sensitive Marginal of Exponential Families (10 Points) Suppose that we have a separable Exponential Family n-distribution such that the natural pa-
rameters η ∈ Rm and sufficient statistics u : X × S → Rm are separable as: η=[η0,η1], η0 ∈Rm0 andη1 ∈Rm1;
u=[u0,u1], u0 :X×S→Rm0 andu1 :X×S→Rm1; such that its p.d.f. q(x, s) is given by
q(x,s) = exp(η⊤u(x,s)). Z(η)
Question 2.7 [COMP8600]: Representation Rate of Exponential Families (5 Points) Given the same setup as Question 2.6, prove that
RR(q)≥RR(q0)· mins∈S Eq0(x|s)[exp(η1⊤u1(x,s))]. (2.13) maxs∈S Eq0(x|s)[exp(η1⊤u1(x,s))]
[exp(η1⊤u1(x, s))],
Now we can use this result to prove a bound on the fairness of a (separable) exponential family.
Prove that
q(s) = q0(s) · Z0(η0) · E Z(η) q0(x|s)
where q0 and Z0(η0) are defined via Exp(u0,η0).

Section 3: Classification with Rejection (50 Total Points)
In the majority of the course, we will be learning about how do we teach an algorithm to make a certain prediction. However, should we always make an algorithm make prediction? The Classification with Rejection (CwR) framework allows for an algorithm to ‘abstain’ from making a prediction.
But before that, we will first review some basic in loss function theory. In classification, in essence our goal is to simple assign the correct labels to a set of input features. This ‘correctness’ can be simply quantified by the 0-1-loss function: given an input x ∈ X and a label y ∈ Y = {1 . . . K}, the correctness of a hypothesis f : X → Y can be evaluated by:
􏰃1 iff(x)̸=y
l01(f(x), y) = 0 otherwise . (3.1)
One can verify, that this loss function is simply assigning incorrect prediction a penalty of 1.
As per the lectures, we will be consider the Risk Minimization of these losses. In particular, given a
distribution of data points p(x, y) we will denote:
R01[f] = E[l01(f(x), y)]. (3.2)
Here we note that we use the subscript to designate what type of loss we are calculating the risk for.
To modify the typical notion of classification to add in the option of rejection, we consider a modified version of the 0-1-loss function. First, we modify the class of functions we are interested in to those that can predict an additional rejection class, denoted as 􏰎. That is, we consider functions f : X → Y ∪ {􏰎}. This leads us to the 0-1-c-loss function, a modified 0-1-loss function which accounts for rejection:
􏰃c if f(x) = 􏰎
l01c(f(x), y) = l01(f(x), y) otherwise , (3.3)
where c ∈ [0,1/2). We define the corresponding risk with “01c” subscripts, i.e., R01c[f].
We will refer to this approach of learning a classifier with rejection as the general approach, i.e., learning
functions of the type f : X → Y ∪ {􏰎}.
We note that although the classifier f now has the option to reject, the data we are evaluating the loss
function is still the same.
Question 3.1: Verify Preferable Rejection (5 Points) Verify that setting c ∈ [0, 1/2) in l01c ensures that choosing a rejection is preferable if there is less
than 1/2 chance of f being correct.
Hint: Fix an input x and consider two classifiers, one which reject and on which does not. How
do the expected losses compare?
Assumption: In sequel, we will assume c ∈ [0, 1/2).
One question we can ask, is what is the optimal classifier in the CwR setting, given perfect information.

Question 3.2: Optimal General Rejection Classifier
Given the true distribution of data p(x, y), with class posterior function η (x)def p(y|x) fory∈Y.
(10 Points)
Prove that where
f⋆ =argminR01c[f],
(3.4) (3.5)
argmaxy∈Y ηy(x)
if max η (x) ≤ 1 − c y∈Y y
Hint: Consider the best response to a fixed x. When is a 􏰎 the best response? Otherwise, what is best response?
More practically, obtaining the optimal CwR classifier difficult. As a result, a number of different methods of altering the learning task has been consider. One of these is called the classifier-rejector approach. The intuition of this approach is simple: we learn two separate classifiers where one determines rejection; while the other is a regular classifier,
􏰃􏰎 if r(x) ≤ 0
fcr (x; h, r) = h(x) otherwise , (3.7)
where r : X → R and h : X → Y. The learned function r(x) acts as a proxy for the ‘confidence’ of h(x)’s prediction. One can verify that
􏰃c l01cr(h(x), r(x), y) = l01(h(x), y)
l01c(fcr(x; h, r), y) = l01cr(h(x), r(x), y),
(5 Points)
(3.10) (3.11)
if r(x) ≤ 0 otherwise.
Question 3.3: Optimal Classifier-Rejection Approach
For the corresponding risk R
h⋆ (x) = arg max ηy (x); y∈Y
r⋆(x) = max ηy(x) − (1 − c). y∈Y
[h⋆, r⋆] def E[l (h⋆(x), r⋆(x), y)], prove that = 01cr
R01cr[h⋆, r⋆] = R01c[f⋆], thus proving that h⋆, r⋆ are optimal.
When would Eq. (3.12) hold for arbitrary h,r,f?
From Question 3.2 and 3.3, we have shown that there are two different ways of minimizing l01c: a general method, and one which requires two learned classifiers. There are many pros and cons for why one would pick between these two. We will explore the sample complexity of calculating the risks. First, let us define the empirical risk of each of the losses:
l01c(f(xi), yi) l01cr(h(xi), r(xi), yi).
R01c[f] = N
R01cr[h, r] = N
Code Help
Remark. Notice that the notation is slightly different when signifying the empirical risk.
Question 3.4: Sample Complexity of Classifier-Rejection Approach (10 Points)
Suppose that H and R are two finite hypothesis classes which are used in the classifier-rejector approach.
Prove that given N samples, for any δ > 0:
􏰃 􏰉ln|H|+ln|R|+ln(1/δ)􏰄
Pr ∀h∈H,r∈R:|R01cr[h,r]−Rˆ01cr[h,r]|≤ N ≥1−δ. (3.13)
Question 3.5: Comparing Sample Complexity (10 Points)
Let H, R be finite hypothesis classes corresponding to the classifier-rejection approach with binary rejector, i.e., classes of functions r : X → {−1,+1} and h : X → Y. Further let F be a finite hypothesis class for learning the general approach, i.e., class of functions f : X → Y ∪ {􏰎}.
Suppose that hypothesis classes H, R, and F have size O(|X|2), O(|X||Y|), and O(|X||Y|+1), respectively (assuming finite X).
How does the general approach’s sample complexity (i.e., equivalent bound to Eq. (3.13)) compare to the classifier-reject approach?
How does this change with the number of classes, k, being predicted over? How does it change as the input space X increases?
We will end on a practical note. The optimal predictor found in Question 3.3 allows us to utilize a plug- in estimator to make a CwR model. In particular, we can simply replace ηk(x) with a model which is approximating the probability, i.e., through what we have looked at in Question 2. That is, we will use the classifier you have trained in the Question 2.4 — we are letting ηk(x) = q(y | x), with q defined in Question 2.3.
Question 3.6: Implementing Classifier with Rejection
In classification_with_rejection.py complete the following functions: • h_classifier_pred (Eq. (3.10));
• r_rejector_pred (Eq. (3.11));
• cwr_pred (Eq. (3.7));
One can also check display.ipynb to see if the rejected examples make sense.
(10 Points)