Likelihood, and Maximum Likelihood, in Statistics

Tomte1 pts0 comments

Likelihood, and Maximum Likelihood, in Statistics

Notebooks

Likelihood, and Maximum Likelihood, in Statistics

Last update: 01 Jul 2026 23:12

First version: 24 June 2026

We observe a random variable \( X \). (It may be a big, hairy,<br>high-dimensional beast, with lots of components, but we'll treat it as one<br>object for now.) We also have a probability model with an adjustable parameter<br>\( \theta \). (This may also be an enormous infinite-dimensional object.) For<br>each \( \theta \) we get a distribution for \( X \), say \( p(x;\theta) \equiv<br>\mathrm{Prob}_{\theta}(X=x) \). That is, the probability model tells us, for<br>each parameter value, the probability of any particular outcome.<br>Ordinarily, we tend to look at how \( p(x;\theta) \) changes with \( x \) fixed, for some particular \( \theta \).

What statisticians have come to call the likelihood function is

\[<br>L(\theta) \equiv p(X; \theta)<br>\]

This is the probability of the data as a function of the parameter.<br>That is, it tells us the probability of observing what we did observe, as we<br>consider varying the parameter.

A natural and compelling approach to parameter estimation is then<br>the method of maximum likelihood : guess that the true<br>parameter value is the one which makes the observed data as probable as<br>possible. This is, as I said, natural and compelling, and it works (is<br>consistent/probably-approximately-correct)<br>under a broad range of circumstances, but unfortunately it doesn't<br>always work.

To see a little bit about why it typically works, but doesn't always, notice<br>that \( L(\theta) \) is a random function, i.e.,<br>a stochastic process. (It is a process<br>"indexed", as we say in the trade, by the parameter space, which may be weird,<br>but still a process.) The method of maximum likelihood looks for the maximum<br>of this random function, and hopes that it converges on the true parameter<br>value. But convergence of<br>stochastic processes is a somewhat delicate business. In many situations,<br>the likelihood function does converge to a sensible, deterministic<br>limiting function which is uniquely maximized at the true parameter value.<br>(When this happy state of affairs applies, the limiting function has<br>nice information-theoretic<br>interpretations.) But there are, alas, times when the convergence just does<br>not work.

[TODO: likelihood ratio tests]

Now, I should at this point admit that the way I've defined likelihood above<br>only works when \( X \) is discrete. If \( X \) is continuous, then one needs<br>to work with probability densities rather than mass functions, which I think<br>makes the rhetoric a bit less persuasive. It also opens the way, to those<br>who've learned measure-theoretic probability, to a more general definition.

(For each \( \theta \), say \( P_{\theta} \) is a probability measure on \( \mathcal{X} \), and these are all absolutely continuous with respect to some reference measure \( M \) (not necessarily a probability measure). Then we define \( L(\theta) = \frac{dP_{\theta}}{d M}(X) \), using the Radon-Nikodym derivative. This makes the exact likelihood function relative to the choice of reference measure \( M \), but notice that for any other reference measure \( N \), we'd have \( \frac{dP_{\theta}}{d N}(X) = \frac{d P_{\theta}}{dM}(X) \frac{dM}{dN}(X) \), so changing the reference measure doesn't change relative likelihoods, the location of the maximum likelihood estimate, etc.)

I should also admit that the idea that one can simply calculate the<br>probability of a given outcome from a probability model is often rather<br>optimistic. This has opened up a range of pseudo-, quasi-, synthetic, and<br>other likelihoods, which try to retain some of the formal structure, while<br>ditching the full probability calculations. One of my reasons for breaking out<br>this notebook is the hope that it will encourage me to wrap my head around<br>these not-quite-likelihoods. (I think I could define the difference<br>between a pseudo- and a quasi- likelihood if I had to, but it's embarrassing<br>for someone in my position not to be sure.)

See also:<br>Empirical Likelihood<br>Large Deviations and Information Theory in the Foundations of Statistics

Recommended, big picture:<br>Stephen M. Stigler, "The Epic Story of Maximum Likelihood",<br>Statistical Science 22 (2007): 598--620,<br>arxiv:0804.2996

Recommended, close-ups:<br>Ronald W. Butler, "Predictive Likelihood Inference with<br>Applications", Journal of the Royal Statistical Society<br>B 48 (1986): 1--38 ["in the predictive setting, all parameters<br>are nuisance<br>parameters". JSTOR]<br>Bradley Efron, "Maximum Likelihood and Decision Theory",<br>The Annals of Statistics 10 (1982): 340--356<br>Charles J. Geyer, "Le Cam Made Simple: Asymptotics of Maximum<br>Likelihood without the LLN or CLT or Sample Size Going to<br>Infinity", arxiv:1206.4762 [There<br>are two separable points here. One is that much of the usual asymptotic theory<br>of maximum likelihood follows from the quadratic form of the<br>likelihood alone; whenever and however that is reached, those<br>consequences follow....

likelihood theta probability maximum parameter function

Related Articles