The Feynman experience · Part 2
Markov chains
Introduction
A Markov chain is an important example of a stochastic process. A sequence of random variables is called a stochastic process with discrete time parameter; the first, , is the initial state, and for the variable is the state of the process at time .1 These sequences differ from the ones met elsewhere in statistics — in the proof of the law of large numbers, say — because the random variables are not necessarily independent and identically distributed. In fact, the interesting question in a stochastic process is precisely how the variables depend on one another over time. Generally, we want the distribution of the next state given everything that has already happened:
The Markov property
A Markov chain models a memoryless stochastic process: the probability of the next state depends only on the present one. This is the Markov property, and it simplifies the equation above to
Past and future are conditionally independent given the present.2 For a finite Markov chain with possible states, the conditional distributions for are called the transition distributions of the chain.1 If they do not change over time, the chain is homogeneous, or said to have stationary transition distributions.
The transition matrix
Consider a finite Markov chain with stationary transition distributions . The transition matrix is the matrix with elements :1
Let’s use one to define a Markov chain with five states. As throughout this series, we work from scratch in plain Python — here, with nothing but NumPy:
import numpy as np
# Each row i holds the probabilities of moving from state i to every state
transition_matrix = np.array([[0.4, 0.6, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.9, 0.0, 0.1],
[0.4, 0.0, 0.1, 0.5, 0.0],
[0.2, 0.7, 0.0, 0.0, 0.1],
[0.5, 0.0, 0.0, 0.5, 0.0]])
We can picture the same chain as a directed graph — one node per state, an arrow for every possible transition, each labelled with its probability:
Because each row holds the probabilities of leaving a given state, every row must sum to one and all entries must be nonnegative. A square matrix with these properties is a stochastic matrix.
# Each row sums to one
np.sum(transition_matrix, axis=1)
# → array([1., 1., 1., 1., 1.])
The initial distribution
We can describe where the chain starts with a row vector of length , where each element is the probability of being in state . This is the initial distribution of the chain:
Given the initial distribution and the transition matrix, the distribution after one step is
which generalises to steps as
Let’s start from state 1 — that is, — and compute the distribution after one step:
state = np.array([1, 0, 0, 0, 0])
state @ transition_matrix
# → array([0.4, 0.6, 0. , 0. , 0. ])
This matches the first row of the transition matrix: from state 1 there is a probability of 0.6 of moving to state 2 and 0.4 of staying put. Now suppose we want the distribution after ten steps:
state @ np.linalg.matrix_power(transition_matrix, 10)
# → array([0.27745538, 0.28189331, 0.23522178, 0.16504972, 0.04037981])
These are the probabilities of being in each state after ten steps, starting from . What if we start somewhere else?
state_2 = np.array([0.7, 0, 0.1, 0, 0.2])
state_2 @ np.linalg.matrix_power(transition_matrix, 10)
# → array([0.27496154, 0.26573116, 0.24972629, 0.16938226, 0.04019874])
The results are remarkably similar — which brings us to the last concept, the stationary distribution.
The stationary distribution
Under some general conditions I won’t go into here, a probability vector that satisfies is called a stationary distribution of the Markov chain.1 That equation should ring a bell from linear algebra: a vector with that property is an eigenvector of the matrix. In other words, the stationary distribution has a closed form — it is the left eigenvector of with eigenvalue 1:
# Take the left eigenvector
v = np.linalg.eig(transition_matrix.T)[1][:, 0]
# Normalise (l1) so the entries sum to one
v.real / np.linalg.norm(v.real, ord=1)
# → array([0.26588235, 0.26823529, 0.26823529, 0.15529412, 0.04235294])
Starting from any initial distribution, the chain converges to this vector as the number of steps . That is why our two ten-step experiments landed in nearly the same place.
Conclusion
We now have some basic tools for exploring Markov chains. A natural next step is Markov chain Monte Carlo (MCMC), which is built on top of them and shows up everywhere in Bayesian inference. The details can be daunting; for a fun application of Markov chains to try your skills on, here’s a nice one: simulating Chutes and Ladders.3
References
-
DeGroot, M. H. & Schervish, M. J. Probability and Statistics. (Pearson Education, 2012). ↩ ↩2 ↩3 ↩4
-
Blitzstein, J. Lectures 31–32: Markov Chains, Statistics 110. (Harvard University, 2013). https://www.youtube.com/watch?v=8AJPs3gvNlY ↩
-
VanderPlas, J. Simulating Chutes & Ladders in Python. Pythonic Perambulations (2017). https://jakevdp.github.io/blog/2017/12/18/simulating-chutes-and-ladders/ ↩