Distances Between Distributions

Total Variation Distance: The total variation distance between distributions $P$ and $Q$, with density functions $p$ and $q$, is defined to be
$TV(P,Q) = \sup_{A\subset E}\lvert\mathbb{P}_p(A)  \mathbb{P}_q(A)\vert.$This is equivalent to half the $L^1$ distance between $p$ and $q$:
$TV(P,Q) = \frac{1}{2}\int_E\lvert p(x)  q(x)\rvert dx.$ This is a genuine metric.
 Unfortunately, it is hard to estimate.

KullbackLeibler Divergence: The KullbackLeibler (known as relative entropy in Information Theory) divergence between $P$ and $Q$ is defined to be
$KL(P\Vert Q) = \int_Ep(x)\log\Big(\frac{p(x)}{q(x)}\Big)dx,$where we assign the value $\infty$ if the support of $p$ is not contained in the support of $q$ (if it is, then anywhere $q=0$, we will also have $p=0$ and thus the points at which the integrand is not defined will all be removable discontinuities).
While positive semidefinite, KLdivergence is not a true metric, since it is antisymmetric. It also fails to satify a trangle inequality. It is, however, an expectation. Hence, it can be replaced with a sample mean and estimated.
 Professor Rigollet calls the act of replacing an expectation with a sample mean (i.e., the application of LLN) "the statistical hammer."
 The implication here is that it's our simplest (and often only) tool.
Examples

Let $X_n = \text{Poi}(1/n)$ and let $\delta_0$ be a point mass centered at 0. Then $TV(X_n,\delta_0) \to 0$.

Let $P = \text{Bin}(n,p)$, $Q = \text{Bin}(n,q)$, where $p,q\in(0,1)$, and write their densities with one function
$f(p, k) = {n \choose k}p^k(1p)^{nk},$and similarly for $f(n,q)$. Then it is actually a pretty straightforward calculation to show that
$KL(P\Vert Q) = np \cdot \log\Big(\frac{p}{q}\Big) + (nnp)\cdot\log\Big(\frac{1p}{1q}\Big).$ 
Let $P = N(a,1)$ and let $Q = N(b,1)$. Then (also pretty straightforward to calculate):
$KL(P\Vert Q) = \frac{1}{2}(ab)^2.$
Maximum Likelihood Estimation
Definitions

Let $X_1,X_2,\ldots,X_n$ be an iid sample from a distribution with density $f(x; \theta)$. The Likelihood of the sample is
$L(X_1,X_2,\ldots,X_n; \theta) = \prod_{i=1}^nf(X_i; \theta).$ 
The loglikelihood function, denoted $\ell(\theta)$ is
$\ell(\theta) = \log(L(x_1, x_2, \ldots, x_n; \theta)).$Note we write $\ell$ as a random function of $\theta$.

The Fisher Information is defined to be
$I(\theta) = E\big[\nabla\ell(\theta)(\nabla\ell(\theta))^T\big]  E\big[\nabla\ell(\theta)\big]E\big[\nabla\ell(\theta)\big]^T = E\big[\mathbf{H}\ell(\theta)\big],$where in this case the likelihood is of a oneelement sample, and the bold H denotes the Hessian operator. In one dimension, this reduces to
$I(\theta) = E(\ell''(\theta)).$Equivalently, we also have
$I(\theta) = Var(\ell'(\theta)).$This latter definition is usually harder to work with, but has a more direct connection to maximum likelihood estimators.
Throughout, we will be discussing ways to estimate the value of a "true" parameter $\theta^\ast$ of a distribution $\mathbb{P}_{\theta^\ast}$, given a model $(E, \{\mathbb{P}_\theta:\theta\in\Theta\})$. A noble goal might be to build an estimator $\widehat{TV}(P_\theta, P_{\theta^\ast})$ and compute the argmin using this estimator. However, $TV$ distance is hard to estimate in general, so we use $KL$divergence instead. Since this function is an expectation, it can be replaced by a sample mean (using LLN), and is therefore easy to estimate.
For the rest of this section, suppose we are estimating a distribution $\mathbb{P} = \mathbb{P}_{\theta^\ast}$ with a parametric family of distributions $\{\mathbb{P}_\theta : \theta\in\Theta\}$. We will proceed to do this by estimating the minimizer (argmin) of $KL(\mathbb{P}_\theta, \mathbb{P})$, which is $\theta^\ast$, by the positive semidefiniteness (or nonnegative definiteness?) of $KL$.
The strategy for doing so will involve first estimating KL divergence and finding the minimizer of that estimator $\widehat{KL}$. That the argmin of $\widehat{KL}$ converges to the argmin of $KL$ follows from "nice analytic properties" of these functions. I'm guessing that $KL$ is at least $C^1$ and the convergence is relatively strong.
Estimating $KL$ Divergence
Recall that $KL(\mathbb{P}_\theta, \mathbb{P})$ is an expectation: if $f_\theta$ and $f$ are the densities of $\mathbb{P}_\theta$ and $\mathbb{P}$, respectively, then
As a function $\theta\mapsto KL(f_\theta, f)$, this has the form
Thus, by LLN, we have
Finding the Minimum of $\widehat{KL}$
Starting with the above equation, we have
Therefore, the minimizer of $\widehat{KL}$ is the maximum likelihood estimator $\hat{\theta}$ of $\theta^\ast$. Furthermore (avoiding a bunch of details necessary for this implication), we have
The Asymptotic Variance of MLE
The MLE is not only consistent, but also satisfies a central limit theorem:
where $V(\theta^\ast)$ represents the asymptotic variance of $\hat{\theta}$. But what is this asymptotic variance?!? Turns out that under some mild conditions, the asymptotic variance of $\hat{\theta}$ is known.
Theorem Assume the following.
 $\theta^\ast$ is identifiable.
 $\theta^\ast$ is an interior point of $\Theta$.
 The Fisher information matrix $I(\theta)$ is invertible in a neighborhood of $\theta^\ast$.
 All the functions involved are "nice".
 The support of $\mathbb{P}_\theta$ does not depend on $\theta$.
Then
Proof Write $\ell_i(\theta) = \log f_\theta(X_i)$. We start with a couple of observations:
 Since $\hat{\theta}$ is the unique maximizer of $\log(L_n(X_1,X_2, \ldots,X_n; \theta))$,
 Since $\theta^\ast$ is the unique minimizer of $KL(\mathbb{P}_\theta, \mathbb{P})$
and this differs from $E(\ell(\theta))$ by a constant, we have
Now, we start with a Taylor expansion at $\theta^\ast$:
Therefore scaling and applying observation 1, we have
By CLT, the term on the left converges to $N(0, I(\theta^\ast))$. By LLN, the term $n^{1}\sum_i\ell''_i(\theta^\ast)$ converges to $E(\ell''(\theta^\ast)) = I(\theta^\ast)$. Therefore, rearranging, we have
therefore,
Remark: This proof only works in one dimension. In higher dimensions, there is a lack of commutativity that results in a more complicated expression in the end.
Remark: Recall the original definition of Fisher information as the Hessian of loglikelihood. This adds geometric intuition to the result: If the loglikelihood is more tightly curved at $\theta^\ast$, then MLE will vary less around the maximum and vice versa. The word "information" is also more than superficial with this in mind; i.e., more "information" iff less variance, which translates to tighter confidence intervals around MLE.
Method of Moments

Requires model to be wellspecified (unlike MLE, which will always find the nearest distribution $\mathbb{P}_\theta$ to $\mathbb{P}$).

Computationally simpler though.

The idea is we estimate the moments of $\mathbb{P}$ with the empirical moments
$\widehat{m}_k = \frac{1}{n}\sum_{i=1}^nX_i^k$ 
By LLN, these converge to the moments of $\mathbb{P}$ (provided the model is well specified).
Here is how it works. Suppose $\Theta \subset\mathbb{R}^d$ and write
Assume $M$ is onetoone, so that we can write
Then the moments estimator is
We can generalize this to other functions $g_1(x), \ldots, g_d(x)$ which specify $\theta$, i.e.,
where for each $k$,
Then the generalized method of moments estimator is
where for each $k$,
Example: To see a simple example of why we might want to generalize beyond simply estimating moments directly, consider the normal distribution $N(\mu,\sigma^2)$. The GMM estimator has $g_1(x) = x$ and $g_2(x) = x^2  x$.
Asymptotic Normality of GMM estimators
Theorem:
 Assume $M$ is onetoone and $M^{1}$ is continuously differentiable in a neighborhood of $\theta^\ast$.
 Let $\Sigma(\theta)$ be the covariance matrix of the vector $(g_1(X_1), \ldots, g_d(X_2)$ (assume this exists).
Then
where
MLE versus GMM
 In general, the MLE is more accurate. MLE still gives good results if model is misspecified
 Computational issues: Sometimes, the MLE is intractable but MM is easier (polynomial equations)
MEstimation
Suppose we are agnostic of any statistical model, and/or the quantity we are most interested in estimating is not simply the parameter of a distribution. In this case, we can still estimate the quantity by optimizing a suitable objective (e.g., minimizing a cost function). This is called MEstimation (the M stands for maximum or minimum), and is the framework for "traditional" (not statistically motivated) machine learning. The framework is as follows.

Let $X_1, X_2, \ldots, X_n$ be an iid sample from an unspecified probability distribution $\mathbb{P}$.

Let $\mu^\ast$ be some parameter associated to the sample, e.g., some summary statistic.

Find a function $\rho:E\times \mathcal{M} \to \mathbb{R}$, where $\mathcal{M}$ is the set of all possible values for $\mu$, such that the function
$Q(\mu) = \mathbb{E}(\rho(X_1, \mu))$achieves its minimum (or maximum) at $\mu^\ast$.

Replace the estimation with an average and proceed as with MLE.
Examples:

Let $E = \mathcal{M} = \mathbb{R}^d$ and let $\mu^\ast = \mathbb{E}(X)$. An Mestimator is $\rho(x,\mu) = \lVert x  \mu \rVert_2^2$.

Let $E = \mathcal{M} = \mathbb{R}^d$ and let $\mu^\ast$ be a median of $\mathbb{P}$. An Mestimator is $\rho(x,\mu) = \lVert x  \mu \rVert_1^1$.

Let $E = \mathcal{M} = \mathbb{R}$ and let $\mu^\ast$ be the $\alpha$quantile of $\mathbb{P}$. Then an Mestimator is $\rho(x, \mu) = C_\alpha(x\mu)$, where
$C_\alpha(x) = \left\{\begin{matrix} (1\alpha)x & : & x < 0\\ \alpha x & : & x \geq 0 \end{matrix}\right.,$This function is called a check function. >
Asymptotic Normality of Mestimators
In the case of MLE, we have asymptotic normality and known asymptotic variance (inverse fisher information). To what extent do these properties generalize to M estimators? It turns out they generalize quite well. We will have asymptotic normality for Mestimators, and the asymptotic variance will have an expression only marginally less concise than that of the MLE (this is probably subject to some smoothness conditions on $\rho$). First, we make the following definitions. In one dimension, let
and let
In higher dimensions,
is the expected curvature of loss and
is the covariance matrix of the loss gradient (as a function of $\mu$ only).
Remark: In the case of MLE, $J(\theta) = K(\theta) = I(\theta)$.
Theorem: With notation as above, assume the following.
 $\mu^\ast$ is the unique minimizer of $Q$;
 $J(\mu)$ is invertible in a neighborhood of $\mu^\ast$;
 A "few more technical conditions." (e.g., twicedifferentiability of $\rho$, inverse of $J$ is continuous, etc.).
Then $\widehat{\mu}_n$ satisfies
and
The proof of this theorem is very similar to the MLE case in one dimension.