mineti.dev

The Feynman experience · Part 2

Markov chains

evergreen · Dec 14, 2018

Introduction

A Markov chain is an important example of a stochastic process. A sequence of random variables X1,X2,X_1, X_2, \dots is called a stochastic process with discrete time parameter; the first, X1X_1, is the initial state, and for n=2,3,n = 2, 3, \dots the variable XnX_n is the state of the process at time nn.1 These sequences differ from the ones met elsewhere in statistics — in the proof of the law of large numbers, say — because the random variables are not necessarily independent and identically distributed. In fact, the interesting question in a stochastic process is precisely how the variables depend on one another over time. Generally, we want the distribution of the next state Xn+1X_{n+1} given everything that has already happened:

Pr(Xn+1=jXn=i,Xn1=in1,Xn2=in2,)Pr(X_{n+1} = j \mid X_n = i, X_{n-1} = i_{n-1}, X_{n-2} = i_{n-2}, \dots)

The Markov property

A Markov chain models a memoryless stochastic process: the probability of the next state depends only on the present one. This is the Markov property, and it simplifies the equation above to

Pr(Xn+1=jXn=i)Pr(X_{n+1} = j \mid X_n = i)

Past and future are conditionally independent given the present.2 For a finite Markov chain with kk possible states, the conditional distributions Pr(Xn+1=jXn=i)Pr(X_{n+1} = j \mid X_n = i) for i,j=1,,ki, j = 1, \dots, k are called the transition distributions of the chain.1 If they do not change over time, the chain is homogeneous, or said to have stationary transition distributions.

The transition matrix

Consider a finite Markov chain with stationary transition distributions pij=Pr(Xn+1=jXn=i)p_{ij} = Pr(X_{n+1} = j \mid X_n = i). The transition matrix is the k×kk \times k matrix P\mathbf{P} with elements pijp_{ij}:1

P=(p11p12p1kp21p22p2kpk1pk2pkk)\mathbf{P} = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1k} \\ p_{21} & p_{22} & \cdots & p_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ p_{k1} & p_{k2} & \cdots & p_{kk} \end{pmatrix}

Let’s use one to define a Markov chain with five states. As throughout this series, we work from scratch in plain Python — here, with nothing but NumPy:

import numpy as np

# Each row i holds the probabilities of moving from state i to every state
transition_matrix = np.array([[0.4, 0.6, 0.0, 0.0, 0.0],
                              [0.0, 0.0, 0.9, 0.0, 0.1],
                              [0.4, 0.0, 0.1, 0.5, 0.0],
                              [0.2, 0.7, 0.0, 0.0, 0.1],
                              [0.5, 0.0, 0.0, 0.5, 0.0]])

We can picture the same chain as a directed graph — one node per state, an arrow for every possible transition, each labelled with its probability:

A five-state Markov chain drawn as a directed graph. Each numbered node is a state and each arrow a possible transition labelled with its probability, including self-loops on states 1 and 3. Every state's outgoing probabilities sum to one.
The five-state chain defined by the transition matrix. Each state has its own colour; the arrows leaving it — and their probability labels — share that colour, and those outgoing probabilities sum to one.

Because each row holds the probabilities of leaving a given state, every row must sum to one and all entries must be nonnegative. A square matrix with these properties is a stochastic matrix.

# Each row sums to one
np.sum(transition_matrix, axis=1)
# → array([1., 1., 1., 1., 1.])

The initial distribution

We can describe where the chain starts with a row vector v\mathbf{v} of length kk, where each element viv_i is the probability of being in state ii. This is the initial distribution of the chain:

v=[v1,v2,,vk]\mathbf{v} = [\,v_1, v_2, \dots, v_k\,]

Given the initial distribution and the transition matrix, the distribution after one step is

vn+1=vnP\mathbf{v}_{n+1} = \mathbf{v}_n \mathbf{P}

which generalises to ss steps as

vn+s=vnPs\mathbf{v}_{n+s} = \mathbf{v}_n \mathbf{P}^{s}

Let’s start from state 1 — that is, v=[1,0,0,0,0]\mathbf{v} = [1, 0, 0, 0, 0] — and compute the distribution after one step:

state = np.array([1, 0, 0, 0, 0])

state @ transition_matrix
# → array([0.4, 0.6, 0. , 0. , 0. ])

This matches the first row of the transition matrix: from state 1 there is a probability of 0.6 of moving to state 2 and 0.4 of staying put. Now suppose we want the distribution after ten steps:

state @ np.linalg.matrix_power(transition_matrix, 10)
# → array([0.27745538, 0.28189331, 0.23522178, 0.16504972, 0.04037981])

These are the probabilities of being in each state after ten steps, starting from v=[1,0,0,0,0]\mathbf{v} = [1, 0, 0, 0, 0]. What if we start somewhere else?

state_2 = np.array([0.7, 0, 0.1, 0, 0.2])

state_2 @ np.linalg.matrix_power(transition_matrix, 10)
# → array([0.27496154, 0.26573116, 0.24972629, 0.16938226, 0.04019874])

The results are remarkably similar — which brings us to the last concept, the stationary distribution.

The stationary distribution

Under some general conditions I won’t go into here, a probability vector v\mathbf{v} that satisfies vP=v\mathbf{v}\mathbf{P} = \mathbf{v} is called a stationary distribution of the Markov chain.1 That equation should ring a bell from linear algebra: a vector with that property is an eigenvector of the matrix. In other words, the stationary distribution has a closed form — it is the left eigenvector of P\mathbf{P} with eigenvalue 1:

# Take the left eigenvector
v = np.linalg.eig(transition_matrix.T)[1][:, 0]

# Normalise (l1) so the entries sum to one
v.real / np.linalg.norm(v.real, ord=1)
# → array([0.26588235, 0.26823529, 0.26823529, 0.15529412, 0.04235294])

Starting from any initial distribution, the chain converges to this vector as the number of steps ss \to \infty. That is why our two ten-step experiments landed in nearly the same place.

Conclusion

We now have some basic tools for exploring Markov chains. A natural next step is Markov chain Monte Carlo (MCMC), which is built on top of them and shows up everywhere in Bayesian inference. The details can be daunting; for a fun application of Markov chains to try your skills on, here’s a nice one: simulating Chutes and Ladders.3

References

  1. DeGroot, M. H. & Schervish, M. J. Probability and Statistics. (Pearson Education, 2012). 2 3 4

  2. Blitzstein, J. Lectures 31–32: Markov Chains, Statistics 110. (Harvard University, 2013). https://www.youtube.com/watch?v=8AJPs3gvNlY

  3. VanderPlas, J. Simulating Chutes & Ladders in Python. Pythonic Perambulations (2017). https://jakevdp.github.io/blog/2017/12/18/simulating-chutes-and-ladders/

← All articles