CS861: Theoretical Foundations of Machine Learning Lecture 8 09 22 2023 Univer

CS861: Theoretical Foundations of Machine Learning Lecture 8 – 09/22/2023 University of Wisconsin–Madison, Fall 2023
Lecture 08: Introduction to Minimax Optimality
Lecturer: Kirthevasan Kandasamy Scribed by: Xindi Lin, Zhihao Zhao
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we introduce Minimax Optimality and some related concepts. We will first introduce minimax optimality, then discuss some topics beyond point estimation.
1 Minimax Optimality
Recall that we cannot find an estimator that is uniformly better than other estimators in most cases, i.e.there is no θb better than any other θb′ in the sense that
R(P, θ) ≤ R(P, θ′), ∀P ∈ P, Therefore, we wish to find an estimator which minimizes the maximum risk,
2. Design a ”good prior” Λ on P and lower bound the Bayes’ risk, say by Ln. b
Recall that for any estimator θ,
sup R(P,θ) ≥ EP∼Λ R(P,θ) ≥ EP∼Λ R(P,θΛ)
sup R(P, θ). P∈P
Definition 1. The minimax risk R∗ of a point estimation problem is defined as follows, hi
R = inf sup R(P, θ) = inf sup ES∼P l(θ(P ), θ(S))
θb P∈P θb P∈P
Note that R∗ depends on P, loss function l, and the size of S. An estimator θb∗ which achieves the minimax
risk, i.e. supP∈P R(P,θb∗) = R∗ is said to be minimax-optimal.
How do you compute the minimax risk? Classically, this was done via a concept called the ”least favorable prior”, which involved finding a Bayes’ estimator with constant (frequentist) risk.
In this class, we will follow the following recipe for computing the minimax risk, which we will also apply to general estimation problems.
1. Design a ”good estimator” θˆ, and upper bound its risk by Un, then
R∗ ≤ sup R(P,θ) ≤ Un. P∈P
where θΛ is the Bayes’ estimator. the first inequality holds since “sup ≥ average” and the second inequality holds since the Bayes’ estimator minimizes the Bayes’ risk. By taking the infimum over all estimators, we have R∗ ≥ Ln.
CS Help, Email: tutorcs@163.com
3. Sometimes, we may need to restrict to a sub-class P′ ⊂ P, and construct our prior Λ on P′. We have,
4. If Un = Ln, then Un is the minimax risk and θb is minimax-optimal.
5. It is not always possible to achieve exact equality. However, if Un > Ln, but Un ∈ O(Ln), then Un is the minimax rate and θb is rate-optimal.
Example 1. Let S = {X1,..,Xn} drawn i.i.d. from N(μ,σ2), P = {N(μ,σ2);μ ∈ R}, we will show that
θSM (S) = 1 Pn Xi is minimax-optimal.
sup R(P, θ) ≥ sup R(P, θ).
First, we find the upper bound
Then we find the lower bound via Bayes’ risk. Consider the prior Λ = N(0,τ2), from our example in the last lecture,
def h2iσ2σ2∗σ2 supR(P,θ)=supES∼N(μ,σ2)(μ−θ(S))=sup = =⇒R≤ .
bbnnn P∈P μ∈R μ∈R
R∗ ≥Ln = τ2 +σ2/n holdsforeveryτ2,
∗ 11−1σ2 =⇒R≥sup 2+2 = .
τ2>0τσ/n n
Combining two bounds together, we can conclude that θSM is minimax-optimal and σ2 n
is the minimax
Example 2. Let S = {X1,..,Xn} drawn i.i.d. from Bernoulli(θ), where θ = EX∼P[X]. Let P = {Bernoulli(θ); θ ∈ [0, 1]}. Let us consider θˆ(S) = n1 Pni=1 Xi.
First, the upper bound is found as follows, ˆb2
θ(1 − θ) n ,
1 n−(a+b)2Eθ θ2+(n−2a(a+b))Eθ [θ]+a2. (n+a+b)2
Variance R(P, θ) = ES∼P (θ − θ(S)) = n
To find the lower bound, we use Λ = Beta(a, b) as the prior, then we have the following Bayes’ risk,
supR(P,θˆ)= sup θ(1−θ)= 1 P∈P θ∈[0,1] n 4n
Bychoosinga=b= 2n,weget
a2 n/4 1 1
Ln = (n+a+b)2 = (n+a+b)2 = 4(√n+1)2 = 4n+8√n+4
We have Un > Ln, but Un ∈ O(Ln) =⇒ θSM is rate-optimal and 1 is the minimax-rate. n
As a side note, it can be shown that
θˆ∗=1+√n n Xi +2 1+√n
√n 1Xn ! 1 1  i=1
程序代写 CS代考 加微信: cstutorcs
is minimax-optimal and √ 1 is the minimax risk. 4( n+1)2
Example 3. S = {X1,..,Xn} drawn i.i.d. from P ∈ P. P ={ all distribution with variance bounded by B2}. We will show that θbSM (S) = n1 Pni=1 Xi is minimax-optimal.
Proof For the upper bound,
h 2i variance supR(P,θ)= supES∼P (θ(S)−θ) = sup
B2 ∗ B2 =⇒ R ≤ .
bbnnn P∈P P∈P P∈P
The lower bound can be found by choosing a sub-class P′ = {N(μ,B2);μ ∈ R},
R =infsupR(P,θ)≥infsupR(P,θ)= . θb P∈P θb P∈P′ n
Combining two bounds together, we know that B2 is the minimax risk and θˆSM is minimax-optimal. n
We will now study minimaxi optimality for general estimation problems. The following lessons from point estimation will be useful going forward.
Often, the easiest way to lower bound the maximum risk is to lower bound it via the average risk, by using the fact that the maximum is larger than the average.
We should be careful in how we choose a subset P and prior Λ, so that the lower bound is tight. We still need to design a good estimator to establish the upper bound.
Beyond Point Estimation
We will now extend the ideas beyond point estimation. Our estimation problem will have the following components:
1. A family of distributions P
2. A dataset S of n i.i.d points drawn from P ∈ P
3. A function(parameter) θ : P → Θ. We wish to estimate θ(P) from S.
4. An estimator θˆ = θˆ(S) ∈ Θ
5. A loss function l, l = Φ ◦ ρ, satisfies the following conditions:
• ρ : Θ × Θ → R+ satisfies the following properties for all θ1, θ2, θ3 ∈ Θ, (i) ρ(θ1, θ1) = 0,
(ii) ρ(θ1,θ2)=ρ(θ2,θ1),
(iii) ρ(θ1,θ2)≤ρ(θ1,θ3)+ρ(θ3,θ2).
• Φ : R+ → R+ is non-decreasing. bbb
When we estimate θ(P ) with θ, the loss is l(θ(P ), θ) = Φ(ρ(θ(P ), θ)). 6. The risk of an estimato r θb is,
R(P,θ)=ES∼P Φ◦ρ(θ(P),θ(S)) =ES Φ◦ρ(θ,θ) .
Note that, as before we have overloaded notation so that θ denotes the parameter θ ∈ Θ and the function
θ : P → Θ. Similarly, θb denotes the estimate θb ∈ Θ and the estimator, which maps the data to Θ. 3

Figure 1: Figure to explain θ(P)(·). Black dots are the observed data, blue curve is the regression function θ(P)(·), For every x, the point of intersection, which is marked in red, is the regression result E[Y |X = x].
Definition 2. We can now define the minimax risk R∗ as follows,
R =infsupES Φ◦ρ θ(P),θ(S) . θb P∈P
Example 4. Normal mean estimation Let S = {X1, .., Xn} drawn i.i.d. from N (μ, σ2), where σ2 is known. Here, P = {N(μ,σ2);μ ∈ R},. We wish to estimate θ(P) = EX∼P[X], therefore Θ = R,. If we use the squared loss l(θ1, θ2) = (θ1 − θ2)2,, then ρ = |θ1 − θ2| and Φ(t) = t2.
Example 5. (Regression) Here, P is the set of all distributions with support on X × R, where is the input space. = The parameter space Θ = {h : X → R}, is the class of functions mapping X to . We wish to estimate the regression function, which is given by
θ(P)(·) = E[Y |X = ·] = | {z }
ydP(y|X = x)
regression function: see Fig 1
We have illustrated this in Figure 1. If we use the L2 loss, l(θ1 , θ2 ) = R (θ1 (x) − θ2 (x))2 dx, then, we have
sZ2∆ 2 (θ1(x)−θ2(x)) dx=∥θ1 −θ2∥2, and Φ(t)=t .
浙大学霸代写 加微信 cstutorcs