mineti.dev

The Feynman experience · Part 3

Single-parameter frequentist inference

evergreen · Jan 8, 2019

Introduction

Suppose we have independent and identically distributed random variables YiNormal(μ,σ2)Y_i \sim \text{Normal}(\mu, \sigma^2), with observed values yiy_i. Given a sample of size nn, {y1,y2,,yn}\{y_1, y_2, \dots, y_n\}, we want to infer the mean μ\mu. There are many ways to approach this; here we implement three estimators:

  • the sample mean estimator,
  • the least squares estimator,
  • the maximum likelihood estimator.

Let’s generate a sample to work with. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_mu, real_sigma, n = 5, 1, 1000

# One sample of size 1,000 from Normal(5, 1)
sample_1 = real_sigma * np.random.randn(n) + real_mu
Histogram of a sample of 1,000 draws from a Normal(5, 1) distribution, with the population density curve overlaid.
One sample of 1,000 draws from Normal(5, 1), with the population density overlaid.

Estimators

What is an estimator? Following Casella and Berger, a point estimator is any function W(X1,,Xn)W(X_1, \dots, X_n) of a sample.1 Let’s evaluate a few such functions.

Sample mean

The sample mean is yˉ=1ni=1nyi\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i. Its expected value is

E(Yˉ)=E ⁣(1ni=1nYi)=1nE ⁣(i=1nYi)=1ni=1nE(Yi)=1ni=1nμ=1nnμ=μ\begin{aligned} E(\bar{Y}) &= E\!\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{1}{n} E\!\left(\sum_{i=1}^{n} Y_i\right) \\ &= \frac{1}{n}\sum_{i=1}^{n} E(Y_i) = \frac{1}{n}\sum_{i=1}^{n} \mu \\ &= \frac{1}{n}\, n\mu = \mu \end{aligned}

and, following the same steps for the variance,

V(Yˉ)=V ⁣(1ni=1nYi)=1n2V ⁣(i=1nYi)=1n2i=1nV(Yi)=1n2i=1nσ2=1n2nσ2=σ2n\begin{aligned} V(\bar{Y}) &= V\!\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{1}{n^2} V\!\left(\sum_{i=1}^{n} Y_i\right) \\ &= \frac{1}{n^2}\sum_{i=1}^{n} V(Y_i) = \frac{1}{n^2}\sum_{i=1}^{n} \sigma^2 \\ &= \frac{1}{n^2}\, n\sigma^2 = \frac{\sigma^2}{n} \end{aligned}

So we expect the sample mean to land near the real μ=5\mu = 5:

np.round(np.mean(sample_1), 5)
# → 4.99545

Distribution

From the derivations above, the sample mean should follow YˉNormal ⁣(μ,σ2n)\bar{Y} \sim \text{Normal}\!\left(\mu, \frac{\sigma^2}{n}\right) — here Normal(5,0.001)\text{Normal}(5, 0.001). We can check this empirically by drawing a hundred samples and taking the mean of each:

# 100 independent samples of size 1,000
sample_100 = real_sigma * np.random.randn(100, n) + real_mu
mean_100 = np.mean(sample_100, axis=1)
Histogram of 100 sample means clustering tightly around 5, matching the expected Normal(5, 0.001) sampling distribution.
The hundred sample means cluster around 5, tracking the expected Normal(5, 0.001) sampling distribution.

Least squares

Let’s view the same distribution differently — as the simplest possible regression,2 Yi=β0+ϵY_i = \beta_0 + \epsilon with ϵNormal(0,σ2)\epsilon \sim \text{Normal}(0, \sigma^2), where β0\beta_0 is a constant to identify. Here β0\beta_0 and μ\mu are the same thing, and the inference becomes an optimization problem:

β^0=arg minβ0i=1n(yiβ0)2\hat{\beta}_0 = \operatorname*{arg\,min}_{\beta_0} \sum_{i=1}^{n}(y_i - \beta_0)^2

Differentiating with respect to β0\beta_0 and setting the result to zero,

ddβ0i=1n(yiβ^0)2=02i=1n(yiβ^0)=02nβ^0=2i=1nyiβ^0=1ni=1nyi\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}\beta_0} \sum_{i=1}^{n}(y_i - \hat{\beta}_0)^2 &= 0 \\ -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0) &= 0 \\ 2n\hat{\beta}_0 &= 2\sum_{i=1}^{n} y_i \\ \hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^{n} y_i \end{aligned}

The least squares solution is exactly the sample mean, β^0=yˉ\hat{\beta}_0 = \bar{y}. We can confirm it numerically with SciPy’s optimizer:

from scipy.optimize import minimize

def squares(beta, sample):
    return np.sum((sample - beta) ** 2)

beta0_hat = minimize(squares, 0, args=sample_1).x[0]
round(beta0_hat, 5)
# → 4.99545

Because the estimator is identical to the sample mean, repeating it across the hundred samples yields exactly the same sampling distribution as above.

Maximum likelihood

For the third estimator we first need the likelihood. Assuming the variables are i.i.d.,

F(β0,σ)=P(Y1=y1,,Yn=yn)=i=1nP(Yi=yi)=i=1n12πσe12(yiβ0σ)2=(12πσ)ne12i=1n(yiβ0)2σ2logF(β0,σ)=nlog ⁣(2πσ2)12σ2i=1n(yiβ0)2\begin{aligned} F(\beta_0, \sigma) &= P(Y_1 = y_1, \dots, Y_n = y_n) = \prod_{i=1}^{n} P(Y_i = y_i) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\,\sigma}\, e^{-\frac{1}{2}\left(\frac{y_i - \beta_0}{\sigma}\right)^2} \\ &= \left(\frac{1}{\sqrt{2\pi}\,\sigma}\right)^{n} e^{-\frac{1}{2}\frac{\sum_{i=1}^{n}(y_i - \beta_0)^2}{\sigma^2}} \\ \log F(\beta_0, \sigma) &= -n\log\!\left(\sqrt{2\pi\sigma^2}\right) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0)^2 \end{aligned}

Maximizing the log-likelihood over β0\beta_0 means setting its derivative to zero; only the second term depends on β0\beta_0:

β0[12σ2i=1n(yiβ0)2]=0i=1n(yiβ0)=0β^0=1ni=1nyi\begin{aligned} \frac{\partial}{\partial \beta_0}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0)^2\right] &= 0 \\ \sum_{i=1}^{n}(y_i - \beta_0) &= 0 \\ \hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^{n} y_i \end{aligned}

Once again the answer is the sample mean. Numerically, we minimize the negative log-likelihood:

def neg_log_likelihood(beta, sample):
    # Negative sign so we can reuse minimize
    return n * np.log(np.sqrt(2 * np.pi * real_sigma ** 2)) + \
        np.sum((sample - beta) ** 2) / (2 * real_sigma ** 2)

beta0_hat = minimize(neg_log_likelihood, 0, args=sample_1).x[0]
round(beta0_hat, 5)
# → 4.99545

All three estimators agree to five decimals — which is the point: different principles, same estimator.

Inference

Reference range

If the population follows Normal(μ,σ2)\text{Normal}(\mu, \sigma^2), the sample mean follows Normal ⁣(μ,σ2n)\text{Normal}\!\left(\mu, \frac{\sigma^2}{n}\right). Knowing μ\mu and σ\sigma, we could state a reference range for the plausible values of the sample mean:

RR(Yˉ,α):μ±Zα/2σnRR(\bar{Y}, \alpha) : \mu \pm Z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}}

Confidence interval

In practice we don’t know μ\mu. Instead we build an interval around the sample mean Yˉ\bar{Y} that probably contains it — a confidence interval:

CI(μ,α):Yˉ±Zα/2σnCI(\mu, \alpha) : \bar{Y} \pm Z_{\alpha/2}\, \frac{\sigma}{\sqrt{n}}

Using our hundred samples and Zα/2=2Z_{\alpha/2} = 2 (about two standard deviations, so roughly a 95% interval), we expect close to 95 of them to contain the real μ=5\mu = 5:

# A ~95% interval (Z ≈ 2) around each of the 100 sample means
ci_100 = [(m - 2 / np.sqrt(n), m + 2 / np.sqrt(n)) for m in mean_100]

# How many contain the true mu = 5?
contains_mu = [lo < 5 < hi for lo, hi in ci_100]
np.sum(contains_mu)
# → 97

97 out of 100 — comfortably in line with the expected 95%.

References

  1. Casella, G. & Berger, R. L. Statistical Inference (2nd ed.). (Duxbury, 2002).

  2. Costa, M. A. Tópicos em Ciência dos Dados: Introdução aos Modelos Paramétricos. (Bonecker, 2019).

← All articles