STAT 861: Theoretical Foundations of Machine Learning University of Wisconsin–Ma

CS/ECE/STAT-861: Theoretical Foundations of Machine Learning University of Wisconsin–Madison, Fall 2023
Homework 0. Due 9/15/2023, 11.00 am
Instructions:
1. Homework is due at 11 am on the due date. Please hand over your homework at the beginning of class. Please see the course website for the policy on late submission.
2. I recommend that you typeset your homework using LATEX. You will receive 5 percent extra credit if you do so. If you are submitting hand-written homeworks, please make sure it is cleanly written up and legible. I will not invest undue effort to understand bad handwriting.
3. You must hand in a hard copy of the homework. The only exception is if you are out of town in which case you must let me know ahead of time and email me a copy of the homework by 11 am on the due date. If this is the case, your homework must be typeset using LATEX. Please do not email written and scanned copies.
4. One of the objectives of Homework 0 is to test if you are familiar with the necessary background topics to take this class. While I do not expect you to know the solution right away, you should be able to solve most of the questions with reasonable effort after looking up any necessary references. If you find this homework exceedingly difficult (especially problems 1 and 2), please talk to me.
5. Collaboration: You are allowed to collaborate on problem 3 of this homework in groups of size up to 3. If you do so, please indicate your collaborators at the top of your solution. You are not allowed to collaborate on problems 1 and 2.
Programming Help
1 Estimating the mean of a normal distribution
You are given n independent samples X1, . . . , Xn sampled from a normal distribution N (μ, σ2) with unknown mean μ, but known variance σ2. You wish to estimate the mean μ. One straightforward option is to estimate μ using the
sample mean μbSM, defined below:
error of an estimator,
R(μbSM, μ) = E[(μbSM − μ)2].
Xi. [2pts]Toquantifytheperformanceofthisestimator,wedefinetheriskR,whichissimplytheexpectedsquared
Here, the expectation E is with respect to the randomness in the data. Show that the risk is σ2/n for the sample mean estimator.
[4 pts] (Concentration) While the risk measures how well an estimator does in expectation, sometimes we also wish to know that μbSM is within some margin of error ε of the true mean μ with high probability. Prove the following result for any ε > 0:
−nε2 P(|μbSM −μ|>ε)≤2exp 2σ2 .
where the probability P is with respect to the randomness in the data. You may use the following facts about normal random variables:
• If X1 , . . . , Xn are normal, then so is Pni=1 Xi . (You will need to compute the mean and variance.) • If X are normal, then so is aX for any a ∈ R. (You will need to compute the mean and variance.) • If Z ∼ N(0,1) is a standard normal random variable, then P(|Z| > ε) ≤ 2e−ε2/2.
[2 pts] (Sample complexity) Suppose you are given some ε > 0 and δ ∈ (0, 1). You wish to collect enough samples so that your estimator is within an ε margin of error with probability at least δ. Show that if n ≥
2σ2 log(2/δ), we will have the following guarantee: P(|μb − μ| > ε) ≤ δ. ε2 SM
N.B: In the first part of the class, we will study a procedure called empirical risk minimization for learning, which essentially chooses a model that performs well on the observed data. After learning, we need to demon- strate that our learned model will do well on future unobserved data. Hence, we need to find conditions under which a model’s performance on the observed dataset will translate to its future generalization performance. Concentration will be an important tool in such proofs.
Which estimator is better?
The sample mean is just one of several possible estimators for μ. Student A proposes the following alternative estima-
tor μbα with some parameter α ∈ (0, 1),
In this question, we will explore if and when μbα could be a better estimator than μbSM.
1. [2 pts] (Bias–variance decomposition) First, show that the following holds for any estimator μb,
R(μb, μ) = E[(μb − μ)2] = (E[μb] − μ)2 + E[(μb − E[μb])2] . | {z } | {z }
bias variance
2. [2 pts] Using the result from part 1, compute the risk of the estimator μbα. Note that, unlike the sample mean, the risk of μbα depends on the true mean μ.
α Xn μbα=n Xi.
Code Help, Add WeChat: cstutorcs
3. [2 pts] Show that there exists at least one value for μ such that μbα is a strictly better estimator than μbSM. That is, there exists μ ∈ R, such that, for all α ∈ (0, 1), we have R(μbα, μ) < R(μbSM, μ). 4. [4 pts] (Maximum risk) Despite the result from part 3, Student B is not satisfied with Student A’s proposition, as an estimator should perform well for all values of μ, and not just for one value of μ. In particular, they argue that the worst-case risk over all μ should be small. They propose the following criterion, the maximum risk R, as a way to measure how well an estimator performs. R(μb) = sup R(μb, μ) = sup E[(μb − μ)2]. (a) Compute R(μbSM) and R(μbα). (b) Based on the above answers, which estimator would you choose? 5. [5 pts] (Maximum risk over a bounded domain) Suppose we had prior knowledge that μ ∈ [0, 1]. While student A agrees with student B’s criterion, they argue that we should modify the definition of the maximum risk to incorporate this prior knowledge. They propose the following definition: R′(μb) = sup R(μb, μ) = sup E[(μb − μ)2]. μ∈[0,1] μ∈[0,1] (a) Compute R(μbSM) and R(μbα), the maximum risk for the two estimators discussed above? (b) Is there any particular value of α (possibly dependent on n and σ) for which R′(μbα) < R′(μbSM)? (c) Based on the above answer, which estimator would you choose? Intuitively, explain the discrepancy in the conclusions in part 4 and part 5. N.B: In the second part of the class, we will explore the concept of minimax optimality. Here, we will design algorithms that have small maximum risk over a class of distributions, and establish lower bounds to prove that no other algorithm can have a significantly smaller maximum risk. We will start with some simple estimation problems, and extend the ideas to regression, classification, density estimation, online learning, and bandits. 3 Understanding exploration–exploiting trade-offs This question is more difficult than the two previous questions. You are allowed to collaborate on this question with up to two classmates. Consider the following game which proceeds over T rounds. You have access to two normal distributions ν1 = N(μ1,σ2) and ν2 = N(μ2,σ2), where σ2 is known but μ1,μ2 ∈ [0,1] are not. On each round t, you get to choose one distribution It ∈ {1, 2}, where It = i corresponds to choosing νi = N (μi, σ2). Once you draw the sample, you earn a monetary reward that is equal to the value of the sample. If the value of the sample is negative, you should pay that amount instead. Your total cumulative reward, over T rounds is PTt=1 Xt where Xt ∼ νIt . We will measure how well we perform via our average regret, defined below: RT =max{μ1,μ2}−T Algorithm: A student proposes the following simple algorithm. First sample each of the distributions N times (where We wish to design an algorithm whose average regret vanishes1 with T in expectation, i.e E[RT ] → 0 as T → ∞. N < T /2). Then, for the remaining T − 2N rounds, sample the distribution with the highest observed sample mean 1Intuitively, if we knew a priori which mean was larger, we will always pull the arm with the highest mean and have E[RT ] = 0 as T1 E[Pt Xt ] = max{μ1 , μ2 }. If E[RT ] → 0, this means we are able to learn which of the two distributions has a larger mean and converge towards the correct answer as we collect more samples. using the N samples. That is, It is chosen as follows: 1 ift≤N, 2 ifN+1≤t≤2N, 1 ift>2Nandμb1≥μb2, 2 ift>2Nandμb1<μb2. N 2N where,μb1=XXt, μb2= X Xt, t=1 t=N +1 For what follows, let ∆ = |μ1 − μ2| denote the gap between the two means. 1. [5 pts] (Regret decomposition) Establish the following identity for the expected average regret: N∆ (T −2N)∆ r N ! E[RT]= T + T Φ −∆ 2σ2 Here, Φ(x) = PZ∼N(0,1)(Z < x) is the CDF of the standard normal distribution. 2. [2 pts] Using the result from part 1 and the fact that μ1,μ2 ∈ [0,1], show the following upper bound on the expected average regret. N −N∆2 E[RT]≤ T +∆exp 4σ2 . You may use the following property about standard normals, which is a one-sided version of the inequality given in problem 1. If Z ∼ N(0,1), then P(Z < −ε) ≤ e−ε2/2. 3. [3 pts] Use the result in part 2 to show the following upper bound. E[RT]≤NT +C√σ , where,C=√2e−1/2. Hint: Consider the function f(x) = log(x) − αx2, where α > 0. What is the maximizer of f?
4. [2 pts] (An optimal choice of N .) Specify a choice for N , depending only on σ and T , so that the upper bound inpart3isminimized.AreyouabletoachieveE[RT]→0asT →∞?Ifso,atwhatratedoesitgotozero?
5. [2 pts] (Exploration–exploitation trade-off.) Let N ⋆ denote the optimal choice in part 4. In words, explain what would happen had we chosen N ≪ N⋆ or N ≫ N⋆.
N.B: In the third part of the class, we will study several models for adaptive decision-making. The model discussed in this question is an example of a stochastic bandit, which is one paradigm for decision-making. In bandit settings, we often have to trade-off between exploration (learning about the environment) and exploitation (leveraging what we have learned to maximize rewards). The above algorithm is a simple, albeit sub-optimal, procedure where we have an explicit exploration phase (first 2N rounds) and exploitation phase (last T − 2N rounds). In class, we will look at better algorithms to manage this trade-off which have faster rates of convergence.
程序代写 CS代考加微信: cstutorcs