Jon's Dev Blog

Stats 2 - Parameter Estimation

February 14, 2021

Distances Between Distributions

  1. Total Variation Distance: The total variation distance between distributions PP and QQ, with density functions pp and qq, is defined to be

    TV(P,Q)=supAEPp(A)Pq(A).TV(P,Q) = \sup_{A\subset E}\lvert\mathbb{P}_p(A) - \mathbb{P}_q(A)\vert.

    This is equivalent to half the L1L^1 distance between pp and qq:

    TV(P,Q)=12Ep(x)q(x)dx.TV(P,Q) = \frac{1}{2}\int_E\lvert p(x) - q(x)\rvert dx.
    1. This is a genuine metric.
    2. Unfortunately, it is hard to estimate.
  2. Kullback-Leibler Divergence: The Kullback-Leibler (known as relative entropy in Information Theory) divergence between PP and QQ is defined to be

    KL(PQ)=Ep(x)log(p(x)q(x))dx,KL(P\Vert Q) = \int_Ep(x)\log\Big(\frac{p(x)}{q(x)}\Big)dx,

    where we assign the value \infty if the support of pp is not contained in the support of qq (if it is, then anywhere q=0q=0, we will also have p=0p=0 and thus the points at which the integrand is not defined will all be removable discontinuities).

    While positive semi-definite, KL-divergence is not a true metric, since it is anti-symmetric. It also fails to satify a trangle inequality. It is, however, an expectation. Hence, it can be replaced with a sample mean and estimated.

    1. Professor Rigollet calls the act of replacing an expectation with a sample mean (i.e., the application of LLN) "the statistical hammer."
    2. The implication here is that it's our simplest (and often only) tool.


  1. Let Xn=Poi(1/n)X_n = \text{Poi}(1/n) and let δ0\delta_0 be a point mass centered at 0. Then TV(Xn,δ0)0TV(X_n,\delta_0) \to 0.

  2. Let P=Bin(n,p)P = \text{Bin}(n,p), Q=Bin(n,q)Q = \text{Bin}(n,q), where p,q(0,1)p,q\in(0,1), and write their densities with one function

    f(p,k)=(nk)pk(1p)nk,f(p, k) = {n \choose k}p^k(1-p)^{n-k},

    and similarly for f(n,q)f(n,q). Then it is actually a pretty straightforward calculation to show that

    KL(PQ)=nplog(pq)+(nnp)log(1p1q).KL(P\Vert Q) = np \cdot \log\Big(\frac{p}{q}\Big) + (n-np)\cdot\log\Big(\frac{1-p}{1-q}\Big).
  3. Let P=N(a,1)P = N(a,1) and let Q=N(b,1)Q = N(b,1). Then (also pretty straightforward to calculate):

    KL(PQ)=12(ab)2.KL(P\Vert Q) = \frac{1}{2}(a-b)^2.

Maximum Likelihood Estimation


  1. Let X1,X2,,XnX_1,X_2,\ldots,X_n be an iid sample from a distribution with density f(x;θ)f(x; \theta). The Likelihood of the sample is

    L(X1,X2,,Xn;θ)=i=1nf(Xi;θ).L(X_1,X_2,\ldots,X_n; \theta) = \prod_{i=1}^nf(X_i; \theta).
  2. The log-likelihood function, denoted (θ)\ell(\theta) is

    (θ)=log(L(x1,x2,,xn;θ)).\ell(\theta) = \log(L(x_1, x_2, \ldots, x_n; \theta)).

    Note we write \ell as a random function of θ\theta.

  3. The Fisher Information is defined to be

    I(θ)=E[(θ)((θ))T]E[(θ)]E[(θ)]T=E[H(θ)],I(\theta) = E\big[\nabla\ell(\theta)(\nabla\ell(\theta))^T\big] - E\big[\nabla\ell(\theta)\big]E\big[\nabla\ell(\theta)\big]^T = -E\big[\mathbf{H}\ell(\theta)\big],

    where in this case the likelihood is of a one-element sample, and the bold H denotes the Hessian operator. In one dimension, this reduces to

    I(θ)=E((θ)).I(\theta) = -E(\ell''(\theta)).

    Equivalently, we also have

    I(θ)=Var((θ)).I(\theta) = Var(\ell'(\theta)).

    This latter definition is usually harder to work with, but has a more direct connection to maximum likelihood estimators.

Throughout, we will be discussing ways to estimate the value of a "true" parameter θ\theta^\ast of a distribution Pθ\mathbb{P}_{\theta^\ast}, given a model (E,{Pθ:θΘ})(E, \{\mathbb{P}_\theta:\theta\in\Theta\}). A noble goal might be to build an estimator TV^(Pθ,Pθ)\widehat{TV}(P_\theta, P_{\theta^\ast}) and compute the argmin using this estimator. However, TVTV distance is hard to estimate in general, so we use KLKL-divergence instead. Since this function is an expectation, it can be replaced by a sample mean (using LLN), and is therefore easy to estimate.

For the rest of this section, suppose we are estimating a distribution P=Pθ\mathbb{P} = \mathbb{P}_{\theta^\ast} with a parametric family of distributions {Pθ:θΘ}\{\mathbb{P}_\theta : \theta\in\Theta\}. We will proceed to do this by estimating the minimizer (argmin) of KL(Pθ,P)KL(\mathbb{P}_\theta, \mathbb{P}), which is θ\theta^\ast, by the positive semidefiniteness (or nonnegative definiteness?) of KLKL.

The strategy for doing so will involve first estimating KL divergence and finding the minimizer of that estimator KL^\widehat{KL}. That the argmin of KL^\widehat{KL} converges to the argmin of KLKL follows from "nice analytic properties" of these functions. I'm guessing that KLKL is at least C1C^1 and the convergence is relatively strong.

Estimating KLKL Divergence

Recall that KL(Pθ,P)KL(\mathbb{P}_\theta, \mathbb{P}) is an expectation: if fθf_\theta and ff are the densities of Pθ\mathbb{P}_\theta and P\mathbb{P}, respectively, then

KL(P,Pθ)=EP(log(f(x)fθ(x)))=EP(log(f(x)))EP(log(fθ(x))).KL(\mathbb{P}, \mathbb{P}_\theta) = E_{\mathbb{P}}\bigg(\log\bigg(\frac{f(x)}{f_\theta(x)}\bigg)\bigg) = E_{\mathbb{P}}(\log(f(x))) - E_{\mathbb{P}}(\log(f_\theta(x))).

As a function θKL(fθ,f)\theta\mapsto KL(f_\theta, f), this has the form

KL(P,Pθ)="constant"EP(log(fθ(x))).KL(\mathbb{P}, \mathbb{P}_\theta) = \text{"constant"} - E_{\mathbb{P}}(\log(f_\theta(x))).

Thus, by LLN, we have

KL^(P,Pθ)="constant"1ni=1nlog(fθ(Xi)).\widehat{KL}(\mathbb{P}, \mathbb{P}_\theta) = \text{"constant"} - \frac{1}{n}\sum_{i=1}^n\log(f_\theta(X_i)).

Finding the Minimum of KL^\widehat{KL}

Starting with the above equation, we have

minθΘKL^(P,Pθ)minθΘ1ni=1nlog(fθ(Xi))maxθΘ1ni=1nlog(fθ(Xi))maxθΘlog(i=1npθ(Xi))maxθΘi=1npθ(Xi).\begin{aligned} \min_{\theta\in\Theta}\widehat{KL}(\mathbb{P}, \mathbb{P}_\theta) &\Leftrightarrow \min_{\theta\in\Theta}- \frac{1}{n}\sum_{i=1}^n\log(f_\theta(X_i)) \\ &\Leftrightarrow \max_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^n\log(f_\theta(X_i)) \\ &\Leftrightarrow \max_{\theta\in\Theta}\log\bigg(\prod_{i=1}^n p_\theta(X_i)\bigg) \\ &\Leftrightarrow \max_{\theta\in\Theta}\prod_{i=1}^n p_\theta(X_i). \end{aligned}

Therefore, the minimizer of KL^\widehat{KL} is the maximum likelihood estimator θ^\hat{\theta} of θ\theta^\ast. Furthermore (avoiding a bunch of details necessary for this implication), we have

θ^(p)θ.\hat{\theta} \overset{(p)}{\to} \theta^\ast.

The Asymptotic Variance of MLE

The MLE is not only consistent, but also satisfies a central limit theorem:

n(θ^θ)(d)N(0,V(θ)),\sqrt{n}(\hat{\theta} - \theta^\ast) \overset{(d)}{\to} N\big(0, V(\theta^\ast)\big),

where V(θ)V(\theta^\ast) represents the asymptotic variance of θ^\hat{\theta}. But what is this asymptotic variance?!? Turns out that under some mild conditions, the asymptotic variance of θ^\hat{\theta} is known.

Theorem Assume the following.

  1. θ\theta^\ast is identifiable.
  2. θ\theta^\ast is an interior point of Θ\Theta.
  3. The Fisher information matrix I(θ)I(\theta) is invertible in a neighborhood of θ\theta^\ast.
  4. All the functions involved are "nice".
  5. The support of Pθ\mathbb{P}_\theta does not depend on θ\theta.


V(θ^)=I(θ)1.V(\hat{\theta}) = I(\theta^\ast)^{-1}.

Proof Write i(θ)=logfθ(Xi)\ell_i(\theta) = \log f_\theta(X_i). We start with a couple of observations:

  1. Since θ^\hat{\theta} is the unique maximizer of log(Ln(X1,X2,,Xn;θ))\log(L_n(X_1,X_2, \ldots,X_n; \theta)),
ddθθ=θ^i=1ni=i=1ni(θ^)=0.\frac{d}{d\theta}\bigg\lvert_{\theta=\hat{\theta}}\sum_{i=1}^n\ell_i = \sum_{i=1}^n\ell_i(\hat{\theta}) = 0.
  1. Since θ\theta^\ast is the unique minimizer of KL(Pθ,P)KL(\mathbb{P}_\theta, \mathbb{P})

and this differs from E((θ))E(\ell(\theta)) by a constant, we have

E((θ))=ddθθ=θE((θ))=0.E(\ell'(\theta^\ast)) = \frac{d}{d\theta}\bigg\lvert_{\theta=\theta^\ast}E(\ell(\theta)) = 0.

Now, we start with a Taylor expansion at θ\theta^\ast:

0=i=1ni(θ^)=i=1n[i(θ)+(θ^θ)i(θ)+].0 = \sum_{i=1}^n\ell_i'(\hat{\theta}) = \sum_{i=1}^n\big[ \ell_i'(\theta^\ast) + (\hat{\theta} - \theta^\ast)\ell_i''(\theta^\ast) + \cdots \big].

Therefore scaling and applying observation 1, we have

0=1ni=1n[(i(θ)E[i(θ)])+(θ^θ)i(θ)]1ni=1n(i(θ)E[i(θ)])+1ni=1n(θ^θ)i(θ)=1ni=1n(i(θ)E[i(θ)])+n(θ^θ)1ni=1ni(θ).\begin{aligned} 0 &= \frac{1}{\sqrt{n}}\sum_{i=1}^n\big[ \big(\ell_i'(\theta^\ast) - E[\ell_i'(\theta^\ast)]\big) + (\hat{\theta} - \theta^\ast)\ell_i''(\theta^\ast) \cdots \big] \\ &\approx \frac{1}{\sqrt{n}}\sum_{i=1}^n\big(\ell_i'(\theta^\ast) - E[\ell_i'(\theta^\ast)]\big) + \frac{1}{\sqrt{n}}\sum_{i=1}^n(\hat{\theta} - \theta^\ast)\ell_i''(\theta^\ast) \\ &= \frac{1}{\sqrt{n}}\sum_{i=1}^n\big(\ell_i'(\theta^\ast) - E[\ell_i'(\theta^\ast)]\big) + \sqrt{n}(\hat{\theta} - \theta^\ast)\cdot\frac{1}{n}\sum_{i=1}^n\ell_i''(\theta^\ast). \end{aligned}

By CLT, the term on the left converges to N(0,I(θ))N(0, I(\theta^\ast)). By LLN, the term n1ii(θ)n^{-1}\sum_i\ell''_i(\theta^\ast) converges to E((θ))=I(θ)E(\ell''(\theta^\ast)) = -I(\theta^\ast). Therefore, rearranging, we have

I(θ)n(θ^θ)(d)N(0,I(θ)),I(\theta^\ast)\cdot\sqrt{n}(\hat{\theta} - \theta^\ast) \overset{(d)}{\to} N(0,I(\theta^\ast)),


n(θ^θ)(d)N(0,I(θ)1).\sqrt{n}(\hat{\theta} - \theta^\ast) \overset{(d)}{\to} N(0,I(\theta^\ast)^{-1}).

Remark: This proof only works in one dimension. In higher dimensions, there is a lack of commutativity that results in a more complicated expression in the end.

Remark: Recall the original definition of Fisher information as the Hessian of log-likelihood. This adds geometric intuition to the result: If the log-likelihood is more tightly curved at θ\theta^\ast, then MLE will vary less around the maximum and vice versa. The word "information" is also more than superficial with this in mind; i.e., more "information" iff less variance, which translates to tighter confidence intervals around MLE.

Method of Moments

  • Requires model to be well-specified (unlike MLE, which will always find the nearest distribution Pθ\mathbb{P}_\theta to P\mathbb{P}).

  • Computationally simpler though.

  • The idea is we estimate the moments of P\mathbb{P} with the empirical moments

    m^k=1ni=1nXik\widehat{m}_k = \frac{1}{n}\sum_{i=1}^nX_i^k
  • By LLN, these converge to the moments of P\mathbb{P} (provided the model is well specified).

Here is how it works. Suppose ΘRd\Theta \subset\mathbb{R}^d and write

M(θ)=(m1(θ),,md(θ)).M(\theta) = (m_1(\theta),\ldots,m_d(\theta)).

Assume MM is one-to-one, so that we can write

θ=M1(m1(θ),,md(θ)).\theta = M^{-1}(m_1(\theta), \ldots, m_d(\theta)).

Then the moments estimator is

θ^nMM=M1(m^1,,m^d).\widehat{\theta}_n^{MM} = M^{-1}\big(\widehat{m}_1, \ldots, \widehat{m}_d\big).

We can generalize this to other functions g1(x),,gd(x)g_1(x), \ldots, g_d(x) which specify θ\theta, i.e.,

θ=M1(m1(θ),,md(θ)),\theta = M^{-1}(m_1(\theta), \ldots, m_d(\theta)),

where for each kk,

mk(θ)=Eθ[gk(X)].m_k(\theta) = \mathbb{E}_\theta[g_k(X)].

Then the generalized method of moments estimator is

θ^nGMM=M1(m^1,,m^d),\widehat{\theta}_n^{GMM} = M^{-1}\big(\widehat{m}_1, \ldots, \widehat{m}_d\big),

where for each kk,

m^k=1ni=1ngk(Xi).\widehat{m}_k = \frac{1}{n}\sum_{i=1}^ng_k(X_i).

Example: To see a simple example of why we might want to generalize beyond simply estimating moments directly, consider the normal distribution N(μ,σ2)N(\mu,\sigma^2). The GMM estimator has g1(x)=xg_1(x) = x and g2(x)=x2xg_2(x) = x^2 - x.

Asymptotic Normality of GMM estimators


  • Assume MM is one-to-one and M1M^{-1} is continuously differentiable in a neighborhood of θ\theta^\ast.
  • Let Σ(θ)\Sigma(\theta) be the covariance matrix of the vector (g1(X1),,gd(X2)(g_1(X_1), \ldots, g_d(X_2) (assume this exists).


n(θ^nGMMθ)(d)N(0,Γ(θ))  w.r.t. Pθ,\sqrt{n}\big(\widehat{\theta}_n^{GMM} - \theta^\ast\big) \overset{(d)}{\to} N\big(0, \Gamma(\theta^\ast)\big) \; \text{w.r.t. }\mathbb{P_{\theta^\ast}},


Γ(θ)=[M1θ(M(θ))]TΣ(θ)[M1θ(M(θ))].\Gamma(\theta) = \bigg[\frac{\partial M^{-1}}{\partial \theta} \big(M(\theta)\big)\bigg]^T\Sigma(\theta) \bigg[\frac{\partial M^{-1}}{\partial \theta}\big(M(\theta)\big)\bigg].

MLE versus GMM

  • In general, the MLE is more accurate. MLE still gives good results if model is misspecified
  • Computational issues: Sometimes, the MLE is intractable but MM is easier (polynomial equations)


Suppose we are agnostic of any statistical model, and/or the quantity we are most interested in estimating is not simply the parameter of a distribution. In this case, we can still estimate the quantity by optimizing a suitable objective (e.g., minimizing a cost function). This is called M-Estimation (the M stands for maximum or minimum), and is the framework for "traditional" (not statistically motivated) machine learning. The framework is as follows.

  1. Let X1,X2,,XnX_1, X_2, \ldots, X_n be an iid sample from an unspecified probability distribution P\mathbb{P}.

  2. Let μ\mu^\ast be some parameter associated to the sample, e.g., some summary statistic.

  3. Find a function ρ:E×MR\rho:E\times \mathcal{M} \to \mathbb{R}, where M\mathcal{M} is the set of all possible values for μ\mu, such that the function

    Q(μ)=E(ρ(X1,μ))Q(\mu) = \mathbb{E}(\rho(X_1, \mu))

    achieves its minimum (or maximum) at μ\mu^\ast.

  4. Replace the estimation with an average and proceed as with MLE.


  1. Let E=M=RdE = \mathcal{M} = \mathbb{R}^d and let μ=E(X)\mu^\ast = \mathbb{E}(X). An M-estimator is ρ(x,μ)=xμ22\rho(x,\mu) = \lVert x - \mu \rVert_2^2.

  2. Let E=M=RdE = \mathcal{M} = \mathbb{R}^d and let μ\mu^\ast be a median of P\mathbb{P}. An M-estimator is ρ(x,μ)=xμ11\rho(x,\mu) = \lVert x - \mu \rVert_1^1.

  3. Let E=M=RE = \mathcal{M} = \mathbb{R} and let μ\mu^\ast be the α\alpha-quantile of P\mathbb{P}. Then an M-estimator is ρ(x,μ)=Cα(xμ)\rho(x, \mu) = C_\alpha(x-\mu), where

    Cα(x)={(1α)x:x<0αx:x0,C_\alpha(x) = \left\{\begin{matrix} -(1-\alpha)x & : & x < 0\\ \alpha x & : & x \geq 0 \end{matrix}\right.,

    This function is called a check function. >

Asymptotic Normality of M-estimators

In the case of MLE, we have asymptotic normality and known asymptotic variance (inverse fisher information). To what extent do these properties generalize to M estimators? It turns out they generalize quite well. We will have asymptotic normality for M-estimators, and the asymptotic variance will have an expression only marginally less concise than that of the MLE (this is probably subject to some smoothness conditions on ρ\rho). First, we make the following definitions. In one dimension, let

J(μ)=2μμTQ(μ)=E[2μμTρ(X1,μ)]J(\mu) = \frac{\partial^2}{\partial\mu\partial\mu^T}Q(\mu) = \mathbb{E}\bigg[\frac{\partial^2}{\partial\mu\partial\mu^T}\rho(X_1,\mu)\bigg]

and let

K(μ)=Cov[μρ(X1,μ)].K(\mu) = \text{Cov}\bigg[\frac{\partial}{\partial\mu}\rho(X_1,\mu)\bigg].

In higher dimensions,

J(μ)=E[Hρ]J(\mu) = \mathbb{E}[\mathbf{H}\rho]

is the expected curvature of loss and

K(μ)=Cov[μρ(X1,μ)]K(\mu) = \text{Cov}[\nabla_\mu\rho(X_1,\mu)]

is the covariance matrix of the loss gradient (as a function of μ\mu only).

Remark: In the case of MLE, J(θ)=K(θ)=I(θ)J(\theta) = K(\theta) = I(\theta).

Theorem: With notation as above, assume the following.

  1. μ\mu^\ast is the unique minimizer of QQ;
  2. J(μ)J(\mu) is invertible in a neighborhood of μ\mu^\ast;
  3. A "few more technical conditions." (e.g., twice-differentiability of ρ\rho, inverse of JJ is continuous, etc.).

Then μ^n\widehat{\mu}_n satisfies

μ^n(p)μ\widehat{\mu}_n \overset{(p)}{\to} \mu^\ast


n(μ^nμ)(d)N(0,J(μ)1K(μ)J(μ)1).\sqrt{n}(\widehat{\mu}_n - \mu^\ast) \overset{(d)}{\to} N\big(0, J(\mu^\ast)^{-1} K(\mu^\ast) J(\mu^\ast)^{-1}\big).

The proof of this theorem is very similar to the MLE case in one dimension.

Profile picture

Written by Jon Lamar: Machine learning engineer, former aspiring mathematician, cyclist, family person.