The Feynman experience · Part 3

Single-parameter frequentist inference

evergreen · Jan 8, 2019

Introduction

Suppose we have independent and identically distributed random variables $Y_i \sim \text{Normal}(\mu, \sigma^2)$ , with observed values $y_i$ . Given a sample of size $n$ , $\{y_1, y_2, \dots, y_n\}$ , we want to infer the mean $\mu$ . There are many ways to approach this; here we implement three estimators:

the sample mean estimator,
the least squares estimator,
the maximum likelihood estimator.

Let’s generate a sample to work with. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_mu, real_sigma, n = 5, 1, 1000

# One sample of size 1,000 from Normal(5, 1)
sample_1 = real_sigma * np.random.randn(n) + real_mu

Histogram of a sample of 1,000 draws from a Normal(5, 1) distribution, with the population density curve overlaid. — One sample of 1,000 draws from Normal(5, 1), with the population density overlaid.

Estimators

What is an estimator? Following Casella and Berger, a point estimator is any function $W(X_1, \dots, X_n)$ of a sample.¹ Let’s evaluate a few such functions.

Sample mean

The sample mean is $\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i$ . Its expected value is

\begin{aligned} E(\bar{Y}) &= E\!\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{1}{n} E\!\left(\sum_{i=1}^{n} Y_i\right) \\ &= \frac{1}{n}\sum_{i=1}^{n} E(Y_i) = \frac{1}{n}\sum_{i=1}^{n} \mu \\ &= \frac{1}{n}\, n\mu = \mu \end{aligned}

and, following the same steps for the variance,

\begin{aligned} V(\bar{Y}) &= V\!\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{1}{n^2} V\!\left(\sum_{i=1}^{n} Y_i\right) \\ &= \frac{1}{n^2}\sum_{i=1}^{n} V(Y_i) = \frac{1}{n^2}\sum_{i=1}^{n} \sigma^2 \\ &= \frac{1}{n^2}\, n\sigma^2 = \frac{\sigma^2}{n} \end{aligned}

So we expect the sample mean to land near the real $\mu = 5$ :

np.round(np.mean(sample_1), 5)
# → 4.99545

Distribution

From the derivations above, the sample mean should follow $\bar{Y} \sim \text{Normal}\!\left(\mu, \frac{\sigma^2}{n}\right)$ — here $\text{Normal}(5, 0.001)$ . We can check this empirically by drawing a hundred samples and taking the mean of each:

# 100 independent samples of size 1,000
sample_100 = real_sigma * np.random.randn(100, n) + real_mu
mean_100 = np.mean(sample_100, axis=1)

Histogram of 100 sample means clustering tightly around 5, matching the expected Normal(5, 0.001) sampling distribution. — The hundred sample means cluster around 5, tracking the expected Normal(5, 0.001) sampling distribution.

Least squares

Let’s view the same distribution differently — as the simplest possible regression,² $Y_i = \beta_0 + \epsilon$ with $\epsilon \sim \text{Normal}(0, \sigma^2)$ , where $\beta_0$ is a constant to identify. Here $\beta_0$ and $\mu$ are the same thing, and the inference becomes an optimization problem:

\hat{\beta}_0 = \operatorname*{arg\,min}_{\beta_0} \sum_{i=1}^{n}(y_i - \beta_0)^2

Differentiating with respect to $\beta_0$ and setting the result to zero,

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}\beta_0} \sum_{i=1}^{n}(y_i - \hat{\beta}_0)^2 &= 0 \\ -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0) &= 0 \\ 2n\hat{\beta}_0 &= 2\sum_{i=1}^{n} y_i \\ \hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^{n} y_i \end{aligned}

The least squares solution is exactly the sample mean, $\hat{\beta}_0 = \bar{y}$ . We can confirm it numerically with SciPy’s optimizer:

from scipy.optimize import minimize

def squares(beta, sample):
    return np.sum((sample - beta) ** 2)

beta0_hat = minimize(squares, 0, args=sample_1).x[0]
round(beta0_hat, 5)
# → 4.99545

Because the estimator is identical to the sample mean, repeating it across the hundred samples yields exactly the same sampling distribution as above.

Maximum likelihood

For the third estimator we first need the likelihood. Assuming the variables are i.i.d.,

\begin{aligned} F(\beta_0, \sigma) &= P(Y_1 = y_1, \dots, Y_n = y_n) = \prod_{i=1}^{n} P(Y_i = y_i) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\,\sigma}\, e^{-\frac{1}{2}\left(\frac{y_i - \beta_0}{\sigma}\right)^2} \\ &= \left(\frac{1}{\sqrt{2\pi}\,\sigma}\right)^{n} e^{-\frac{1}{2}\frac{\sum_{i=1}^{n}(y_i - \beta_0)^2}{\sigma^2}} \\ \log F(\beta_0, \sigma) &= -n\log\!\left(\sqrt{2\pi\sigma^2}\right) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0)^2 \end{aligned}

Maximizing the log-likelihood over $\beta_0$ means setting its derivative to zero; only the second term depends on $\beta_0$ :

\begin{aligned} \frac{\partial}{\partial \beta_0}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0)^2\right] &= 0 \\ \sum_{i=1}^{n}(y_i - \beta_0) &= 0 \\ \hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^{n} y_i \end{aligned}

Once again the answer is the sample mean. Numerically, we minimize the negative log-likelihood:

def neg_log_likelihood(beta, sample):
    # Negative sign so we can reuse minimize
    return n * np.log(np.sqrt(2 * np.pi * real_sigma ** 2)) + \
        np.sum((sample - beta) ** 2) / (2 * real_sigma ** 2)

beta0_hat = minimize(neg_log_likelihood, 0, args=sample_1).x[0]
round(beta0_hat, 5)
# → 4.99545

All three estimators agree to five decimals — which is the point: different principles, same estimator.

Inference

Reference range

If the population follows $\text{Normal}(\mu, \sigma^2)$ , the sample mean follows $\text{Normal}\!\left(\mu, \frac{\sigma^2}{n}\right)$ . Knowing $\mu$ and $\sigma$ , we could state a reference range for the plausible values of the sample mean:

RR(\bar{Y}, \alpha) : \mu \pm Z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}}

Confidence interval

In practice we don’t know $\mu$ . Instead we build an interval around the sample mean $\bar{Y}$ that probably contains it — a confidence interval:

CI(\mu, \alpha) : \bar{Y} \pm Z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}}

Using our hundred samples and $Z_{\alpha/2} = 2$ (about two standard deviations, so roughly a 95% interval), we expect close to 95 of them to contain the real $\mu = 5$ :

# A ~95% interval (Z ≈ 2) around each of the 100 sample means
ci_100 = [(m - 2 / np.sqrt(n), m + 2 / np.sqrt(n)) for m in mean_100]

# How many contain the true mu = 5?
contains_mu = [lo < 5 < hi for lo, hi in ci_100]
np.sum(contains_mu)
# → 97

97 out of 100 — comfortably in line with the expected 95%.

Casella, G. & Berger, R. L. Statistical Inference (2nd ed.). (Duxbury, 2002). ↩
Costa, M. A. Tópicos em Ciência dos Dados: Introdução aos Modelos Paramétricos. (Bonecker, 2019). ↩