The Feynman experience · Part 5
Multiple linear regression
Introduction
This is the multi-feature sequel to simple linear regression. For a visual intuition, Luis Serrano’s explanation is a good start;1 the Homemade Machine Learning2 and Machine Learning Basics3 repositories are also worth a look.
Let’s generate some correlated three-dimensional data to model. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:
import numpy as np
np.random.seed(51)
real_mean = [1, 3, 5]
real_cov = np.array([[4, 1, 2], [1, 9, -4], [2, -4, 21]])
n = 100
data = np.random.multivariate_normal(real_mean, real_cov, n)
The model
With a vector of features , the model of given is
which is far more compact in matrix notation. Collecting the targets into a vector and the features into a design matrix with a leading column of ones,
the model becomes
For a sample of size with features we have , a matrix with , the parameter vector , and the error vector , with and .
Parameter estimation
The residual sum of squares in matrix form is
and the least-squares estimator minimizes it:
Differentiating with respect to and setting the result to zero gives a closed form:
Translating that directly into NumPy — features in (with a column of ones), the third variable as the target :
X = np.c_[np.ones(n), data[:, :2]]
Y = data[:, 2]
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
np.round(beta_hat, 3)
# → array([ 6.278, 0.743, -0.607])
We can reach the same coefficients by minimizing the cost directly with SciPy:
from scipy.optimize import minimize
def rss(beta, X, Y):
r = Y - X @ beta
return r @ r
beta_optim = minimize(rss, [0, 0, 0], args=(X, Y)).x
np.round(beta_optim, 3)
# → array([ 6.278, 0.743, -0.607])
Properties of the estimators
Since is a function of the random sample, it is a vector of random variables. Its expected value is
so the estimator is unbiased. For the covariance, using ,
In practice we don’t know , so we estimate it with . The diagonal of the covariance matrix holds the variances of each coefficient:
cov_hat = real_cov[2, 2] * np.linalg.inv(X.T @ X)
np.round(np.diag(cov_hat), 4)
# → array([0.5102, 0.0633, 0.0281])
To check these, we can recover the true coefficients implied by the population mean and covariance,4
real_betas = np.linalg.solve(real_cov[:2, :2], real_cov[:2, 2])
real_beta_0 = real_mean[2] - np.array(real_mean[:2]) @ real_betas
np.round([real_beta_0, *real_betas], 5)
# → array([ 5.91429, 0.62857, -0.51429])
and then estimate across a hundred fresh samples to see how the coefficients are distributed:
np.random.seed(51)
data_100 = np.random.multivariate_normal(real_mean, real_cov, (100, n))
beta_100 = np.array([
np.linalg.inv(Xi.T @ Xi) @ Xi.T @ d[:, 2]
for d in data_100
for Xi in [np.c_[np.ones(n), d[:, :2]]]
])
The coefficients read just as in the single-feature case: a one-unit change in feature shifts the mean target by , and when every feature is 0 the mean target is .
Regularization
When features are many or correlated, it helps to constrain the coefficients. Three standard techniques are LASSO, ridge, and elastic net; the Elements of Statistical Learning covers them in depth.5 The picture below shows the idea: we shrink the least-squares solution toward the origin by confining to a region — a diamond for LASSO (), a circle for ridge () — and the solution lands where an RSS contour first touches it.
LASSO
LASSO — least absolute shrinkage and selection operator — adds an penalty to the objective:
The parameter controls the shrinkage; at we recover ordinary least squares. As a cost function:
def rss_lasso(beta, X, Y, lbd):
r = Y - X @ beta
return r @ r + lbd * np.sum(np.abs(beta))
beta_lasso = minimize(rss_lasso, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_lasso, 3)
# → array([ 6.217, 0.743, -0.594])
Ridge
Ridge swaps the penalty for an one:
def rss_ridge(beta, X, Y, lbd):
r = Y - X @ beta
return r @ r + lbd * np.sum(beta ** 2)
beta_ridge = minimize(rss_ridge, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_ridge, 3)
# → array([ 5.602, 0.824, -0.505])
Elastic net
Both are special cases of a penalty . Choosing compromises between the two,3 but loses LASSO’s knack for driving coefficients exactly to zero. The elastic net keeps that feature selection while still shrinking correlated coefficients like ridge, by blending the two penalties:
with : at it is LASSO, at it is ridge.
def rss_elastic(beta, X, Y, lbd, alpha):
r = Y - X @ beta
return r @ r + lbd * np.sum(alpha * beta ** 2 + (1 - alpha) * np.abs(beta))
beta_elastic = minimize(rss_elastic, [0, 0, 0], args=(X, Y, 5, 0.5)).x
np.round(beta_elastic, 3)
# → array([ 5.891, 0.786, -0.547])
Conclusion
From a single feature to many, the same least-squares idea carries over — only now it is a tidy matrix expression — and regularization gives us a dial between fitting the data and keeping the coefficients in check. That rounds out the regression thread, and this series: a handful of foundational ideas in probability and statistics, each built from scratch with nothing but NumPy and SciPy.
References
-
Serrano, L. A Friendly Introduction to Linear Regression. https://www.youtube.com/watch?v=wYPUhge9w5c ↩
-
Trekhleb, O. Homemade Machine Learning. https://github.com/trekhleb/homemade-machine-learning ↩
-
Peters, A.-L. Machine Learning Basics. https://github.com/zotroneneis/machine_learning_basics ↩ ↩2
-
Using the covariance matrix to find coefficients for multiple regression. Cross Validated. https://stats.stackexchange.com/questions/107597 ↩
-
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (2nd ed.). (Springer, 2009). ↩