CS861: Theoretical Foundations of Machine Learning Lecture 7 – 09/20/2023 University of Wisconsin–Madison, Fall 2023
Lecture 07: Lower Bounds for Point Estimation
Lecturer: Kirthevasan Kandasamy Scribed by: Joseph Salzer, Tony Chang Wang
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the instructor.
In this lecture, we describe the general framework for point estimation and introduce the concept of average (Bayesian) risk optimality.
1 Statistical Lower Bounds
We should always ask ourselves, is our learning algorithm / estimator optimal? For example, our previous
lectures have shown that the ERM process has its error decrease at a rate of around Oe(1/ n). It is natural to ask if this rate of convergence can be improved any further. Such questions can be answered by studying statistical lower bounds as they allow us to characterize the difficulty of a learning problem.
We will start with point estimation, explained below, and then extend the ideas for regression, density estimation, and classification in subsequent lectures.
2 Point Estimation
We are interested in estimating a single parameter of a distribution (e.g mean of the distribution) using data drawn from that distribution.
A point estimation problem consists of the following components:
1. A family of distributions: P.
2. A data set S: i.i.d for some distribution P ∈ P.
3. A parameter of interest: θ = θ(P) ∈ R, where P is the distribution of our data S.
4. An estimator for the parameter based on the drawn data set S: θˆ = θˆ(S) ∈ R.
5. A loss function l : R × R 7→ R, which measures how well θˆ estimates the parameter θ. ˆhˆi
6. Risk: R(P, θ) = ES∼P l(θ(P ), θ(S) , where we take the expectation on the loss function over data S drawn from the distribution P.
R(P, θ) = ES∼P l(θ(P ), θ(S) =
l(θ(P ), θ(s))dP (s)
Example 1 (Normal Mean Estimation).
1. Distribution family P = {N(μ,σ2) : μ ∈ R}, where σ2 is known. 2. Data S = {x1,x2,…,xn} i.i.d from P, P ∈ P.
浙大学霸代写 加微信 cstutorcs
3. Parameter θ = θ(P) = EX∼P [X], P ∈ P
4. Loss function: l(θ1, θ2) = (θ1 − θ2)2, for any two parameters θ1, θ2
5. Risk: RP,θˆ = ES ∼ P l(P, θ(S)) , P ∈ P, θ a parameter estimator
6. In HW0, we showed two estimators θˆ1:
θ2(S) = n ˆ
R(θ,θ1)=ES∼P (θ−θ1(S)) = n
xi,α ∈ [0,1)
h ˆ 2i 2 2 α2σ2
(In both cases, when we take the expectation, θ is fixed, and θˆ is the random variable since θˆ = θˆ(S), and the expectation is w.r.t data S)
Notice that R(θˆ1,θ) > R(θˆ2,θ) for some θ, while R(θˆ2,θ) > R(θˆ1,θ) for some other θ, which brings difficulty to find a optimal estimator θˆ that estimate well for an arbitrary parameter θ.
Extending this example, we see that the estimator θ = μ for some μ ∈ will achieve 0 risk when θ = μ but will perform poorly elsewhere. This illustrates that we cannot find a uniformly good estimator θb⋆ which minimizes R(b, θ) for all θ.
Hence, it is customary to resort to other versions of optimality. The following to notions are common in the literature:
R(θ,θ2)=ES∼P (θ−θ2(S)) =θ (1−α) + n
Minimax Optimality. Find an estimator θˆ that minimizes the maximum risk over the class of distribution P:
min sup R(θ(P ), θˆ) θˆ P∈P
Average Risk Optimality. Find an estimator θˆ which minimizes the average risk over some distri- bution on P. We will study this next.
Average Risk Optimality
Before we define what average risk optimality is, we have to introduce a probability measure Λ over the family of distributions P. This measure Λ ought to be seen as an assignment of various weights to the parameter of interest θ := θ(P) ∈ R, before the data is observed. In Bayesian terminology, Λ is our prior distribution and θ is a random variable.
Definition 1 (Average Risk). The average risk of a parameter is given by: Z
R(P,θ)dΛ(P) = E [R(P,θ)] P∼Λ
Code Help, Add WeChat: cstutorcs
Now, we are equipped to define the Bayes estimator and Bayes risk:
Definition 2 (Bayes Estimator). An estimator θˆΛ which minimizes the average risk is called the Bayes
estimator (provided it exists).
Definition 3 (Bayes Risk). The average risk of the Bayes estimator is called the Bayes risk. Namely,
RΛ(θˆΛ) = min RΛ(θ) θ
It may be difficult to find the Bayes estimator from the definition of average risk alone. Instead we can apply the tower property for conditional expectation and rewrite the average risk as
RΛ(θ) = E E [l(θ(P ), θ(S))|P ] P∼Λ S∼P
hˆi =E E[l(θ(P),θ(S))|S]
Thus, in order to find the Bayes estimator, it suffices to find the value of θˆ(S) that minimizes the conditional risk EP [l(θ(P ), θˆ(S))|S], with the expectation taken over the posterior distribution. As an aside, we have dropped the distributions that S and P are coming from in the second equality above. This is due to an abuse of notation. The inner expectation is with respect to the posterior distribution of the parameter given the data. The outer expectation is with respect to the marginal distribution of S:
P(S ∈ A) =
Lemma 1. Under the squared error loss, l(θ1, θ2) = (θ1 − θ2)2, the Bayes estimator is the posterior mean.
θˆΛ(S) = E[θ(P )|S]
Proof Let θˆ(S) := E[θ(P)|S] be the posterior mean. Consider any other estimator θˆ′ with conditional
average risk:
P(S ∈ A)dΛ(P)
E[(θˆ′ −θ)2|S]=E[(θˆ′ −θˆ+θˆ−θ)2|S] (1)
= E[(θˆ′ − θˆ)2|S] + E[(θˆ − θ)2|S] + 2 E[(θˆ′ − θˆ)(θˆ − θ)|S] (2) |{z} |{z}
2(θˆ −θˆ)E((θˆ−θ)|S)
| {z ˆ } =0 since E(θ|S)=θ
≥ E[(θˆ− θ)2|S]
Clearly θˆ minimizes the conditional average risk (thereby minimizing the average risk) and is thus the Bayes
estimator.
We will now provide two examples on how to explicitly determine both the Bayes estimator and Bayes risk.
n iid2d2 Example 2. Let our data be given by S = {Xi}1. Now suppose Xi|θ ∼ N(θ,σ ) and θ ∼ Λ = N(μ,τ )
with σ2, μ, and τ2 known. Due to normal-normal conjugacy, we have θ|S ∼ N(μ ̃,τ ̃2) where
σ2/n τ2 1Xn !
μ ̃ := τ2 + σ2/nμ + τ2 + σ2/n n 3
2 1 1−1 τ ̃:= τ2+σ2/n
Since μ ̃ is the posterior mean, it is the Bayes estimator. Then the Bayes risk is given by
RΛ(μ ̃)=E E[(θ−μ ̃)2|S] Sθ
posterior variance = E(τ ̃2)
As an aside, recall that the outer expectation is with respect to the marginal (unconditional) distribution of S. Inparticular,Xi ∼N(μ,σ2+τ2). Butsincetheposteriorvariancedoesnotdependonthedata(unlike in our next example), the Bayes risk is just the posterior variance.
n iid d Example 3. Again, let our data be given by S = {Xi}1 . Now suppose Xi|θ ∼ Bernoulli(θ) and θ ∼ Λ =
Beta(a, b) with a, b known. We know, via Bernoulli-Beta conjugacy, that the posterior distirbution is given by
θˆΛ=E(θ|S)= nX ̄+a n+a+b
To find the Bayes risk, we could try to follow the same process as in the example above. However, unlike in
that example, our posterior variance depends on our data through nX ̄; indeed our posterior variance is given
by (nX ̄+a)(b+n−nX ̄) . As such, our outer expectation would be with respect to the marginal distribution of (a+b+n)2 (a+b+n+1)
nX ̄ (unconditional on θ) which is difficult to calculate. Instead we can find the Bayes risk as
R Λ ( θˆ Λ ) = =
” n X ̄ + a 2 # ̄ E n + a + b − θ
θ|S ∼ Beta(nX + a, b + n − nX) where X := n
Thus, the Bayes estimator can be found by taking the mean of the posterior Beta distribution. Namely,
θ∼Beta(a,b) nX|θ∼Binomial(n,θ)
2 E θ ((a+b) −n)+θ(n−2a(a+b))+a θ∼Beta(a,b)
(n + a + b)
which can be further reduced using the first and second moments of the Beta distribution.
CS Help, Email: tutorcs@163.com