\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

key notions

  • Markov Chain Monte Carlo methods
    • Metropolis Hastings
    • Gibbs
  • convergence / representativeness
    • trace plots
    • \(\hat{R}\)
  • efficiency
    • autocorrelation
    • effective sample size

recap

Bayes rule for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \times \underbrace{P(D \, | \, \theta)}_{likelihood}\]

normalizing constant:

\[ \int P(\theta') \times P(D \mid \theta') \, \text{d}\theta' = P(D) \]

easy to solve only if:

  • \(\theta\) is a single discrete variable with reasonably sized domain
  • \(P(\theta)\) is conjugate prior for the likelihood function \(P(D \mid \theta)\)
  • we are very lucky

approximation by sampling

example: normal distribution

# get density values and samples
xVec = seq(-4, 4, length.out = 5000)     # vector of x-coordinates
myPlot = ggplot(tibble(x = rnorm(5000, mean = 0, sd = 1)), aes(x = x)) +  
  geom_density() + 
  geom_line(aes(x = xVec, y = dnorm(xVec, mean = 0, sd = 1)), color = "firebrick") +
  xlab("x") + ylab("(estimated) probability")
myPlot

notation

if \(P(\cdot \mid \vec{\alpha}) \in \Delta(X)\) is a distribution over \(X\) (possibly high-dimensional) with fixed parameters \(\vec{\alpha}\), then write \(x \sim P(\vec{\alpha})\)

  • think \(x\) is a sample from \(P(\vec{\alpha})\)

 

examples

  • \(x \sim \mathcal{N}(\mu,\sigma)\)
    • read: "\(x\) is normally distributed with mean \(\mu\) and SD \(\sigma\)"
  • \(\theta \sim \text{Beta}(a,b)\)
    • read: "\(\theta\) comes from a beta distribution with shape \(a\) and \(b\)"

why simulate?

problem

  • assume that \(x \sim P\) and that \(P\) is "unwieldy"
  • we'd like to know some property of \(P\) (e.g., function \(f(P)\))
    • e.g., mean/median/sd, 95% HDI, cumulative probability \(\int_{0.8}^\infty P(x) \ \text{d}x\), \(\dots\)

dummy

solution: Monte Carlo simulation

  • draw samples \(S = x_1, \dots x_n \sim P\)
  • calculate \(f(S)\) (analog of \(f(P)\) but for samples)
  • for many \(f\), \(f(S)\) will approximate \(f(P)\) if \(S\) is "representative" of \(P\)

Markov chains

Markov chain

 

intuition

a sequence of elements, \(x_1, \dots, x_n\) such that every \(x_{i+1}\) depends only on its predecessor \(x_i\) (think: probabilistic FSA)

probabilistic automaton

 

 

 

 

 

Markov property

\[ P(x_{n+1} \mid x_1, \dots, x_n) = P(x_{n+1} \mid x_n) \]

Markov Chain Monte Carlo methods

get sequence of samples \(x_1, \dots, x_n\) s.t.

  1. sequence has the Markov property (\(x_{i+1}\) depends only on \(x_i\)), and
  2. the stationary distribution of the chain is \(P\).

dummy

consequences of Markov property

  • non-independence -> autocorrelation
    😞
  • easy proof that stationary distribution is \(P\)
    😃
  • we can work with non-normalized probabilities
    😃

Metropolis Hastings

island hopping

islands

  • set of islands \(X = \{x^s, x^b, x^p\}\)
  • goal: hop around & visit every island \(x \in X\) proportional to its population \(P(x)\)
    • think: "samples" \(x \sim P\)
  • problem: island hopper can remember at most 2 islands' population
    • think: we don't know the normalizing constant

Metropolis Hastings

  • let \(f(x) = \alpha P(x)\) (e.g., unnormalized posterior)
  • start at random \(x_0\), define probability \(P_\text{trans}(x_i \rightarrow x_{i+1})\) of going from \(x_{i}\) to \(x_{i+1}\)
    • proposal \(P_\text{prpsl}(x_{i+1} \mid x_i)\): prob. of considering jump to \(x_{i+1}\) from \(x_{i}\)
    • acceptance \(P_\text{accpt}(x_{i+1} \mid x_i)\): prob of accepting jump to proposal \(x_{i+1}\) \[P_\text{accpt}(x_{i+1} \mid x_i) = \text{min} \left (1, \frac{f(x_{i+1})}{f(x_{i})} \frac{P_\text{prpsl}(x_{i} \mid x_{i+1})}{P_\text{prpsl}(x_{i+1} \mid x_i)} \right)\]
    • transition \(P_\text{trans}(x_i \rightarrow x_{i+1}) = P_\text{prpsl}(x_{i+1} \mid x_i) \ P_\text{accpt}(x_{i+1} \mid x_i)\)

7-island hopping

KruschkeFig7.2

7-island hopping, cont.

KruschkeFig7.3

properties of MH

  • motto: always up, down with probability \(\bfrac{f(x^{i+1})}{f(x^{i})}\)
  • ratio \(\bfrac{f(x^{i+1})}{f(x^{i})}\) means that we can neglect normalizing constants
  • \(P_\text{trans}(x^i \rightarrow x^{i+1})\) defines transition matrix -> Markov chain analysis!
  • for suitable proposal distributions:
    • stationary distribution exists (first left-hand eigenvector)
    • every initial condition converges to stationary distribution
    • stationary distribution is \(P\)

influence of proposal distribution

KruschkeFig7.4

Gibbs sampling

Gibbs sampling: introduction

  • MH is a very general and versatile algorithm
  • however in specific situations, other algorithms may be better applicable
    • e.g., when \(x \sim P\) is high-dimensional (think: many parameters)
  • in such cases, the Gibbs sampler may be helpful
  • basic idea: split the multidimensional problem into a series of simpler problems of lower dimensionality

Gibbs sampling: main idea

  • assume \(X=(X_1,X_2,X_3)\) and \(p(x)=P(x_1,x_2,x_3)\)
  • start with \(x^0=(x_1^0,x_2^0,x_3^0)\)
  • at iteration \(t\) of the Gibbs sampler, we have \(x^{i-1}\) and need to generate \(x^{i}\)
    • \(x_1^{i} \sim P(x_1 \mid x_2^{i-1},x_3^{i-1})\)
    • \(x_2^{i} \sim P(x_2 \mid x_1^{i},x_3^{i-1})\)
    • \(x_3^{i} \sim P(x_3 \mid x_1^{i},x_2^{i})\)
  • for a large \(n\), \(x^n\) will be a sample from \(P(x)\)
  • \(x^n_1\) will be a sample from the marginal distribution \(P(x_1)\):

\[ P(x_1) = \int P(\tuple{x_1, x_2, x_3}) \ \text{d} \tuple{x_2, x_3}\]

example: 2 coin flips

KruschkeFig7.5

example: Gibbs jumps

KruschkeFig7.7

example: Gibbs for 2 coin flips

KruschkeFig7.8

example: MH for 2 coin flips

KruschkeFig7.6

summary

  • Gibbs sampling can be more efficient than MH
  • Gibbs needs samples from conditional posterior distribution
    • MH is more generally applicable
  • preformance of MH depends on proposal distribution

clever software helps determine when/how to do Gibbs and how to tune MH's proposal function (next class)

assessing sample chains

problem statements

 

convergence/representativeness

  • we have samples from MCMC, ideally several chains
  • in the limit, samples must be representative of \(P\)
  • how do we know that our meagre finite samples are representative?

 

efficiency

  • ideally, we'd like the shortest samples that are representative
  • how do we measure that we have "enough" samples?

packages to compare MCMC chains

require('coda')
require('ggmcmc')

islands

coda and ggmcmc basically do the same thing, but they differ in, say, aesthetics

ggmcmc lives in the tidyverse

example MCMC data

out = MH(f, iterations = 60000, chains = 2, burnIn = 10000) # self-made MH 
head(out[[1]])      # returns MCMC-list from package 'coda'
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1 
## End = 7 
## Thinning interval = 1 
##          variable
## iteration       mu     sigma
##         1 0.149087 0.9350134
##         2 0.149087 0.9350134
##         3 0.149087 0.9350134
##         4 0.149087 0.9350134
##         5 0.149087 0.9350134
##         6 0.149087 0.9350134
##         7 0.149087 0.9350134

example MCMC data

out2 = ggs(out)  # cast the same information into a format that 'ggmcmc' uses
out2             # this is now a tibble
## # A tibble: 200,000 × 4
##    Iteration Chain Parameter    value
##        <int> <int>    <fctr>    <dbl>
## 1          1     1        mu 0.149087
## 2          2     1        mu 0.149087
## 3          3     1        mu 0.149087
## 4          4     1        mu 0.149087
## 5          5     1        mu 0.149087
## 6          6     1        mu 0.149087
## 7          7     1        mu 0.149087
## 8          8     1        mu 0.149087
## 9          9     1        mu 0.149087
## 10        10     1        mu 0.149087
## # ... with 199,990 more rows

trace plots in 'coda'

plot(out)

trace plots in 'ggmcmc'

ggs_traceplot(out2)

visual inspection of convergence

 

trace plots from multiple chains should look like:

a bunch of hairy caterpillars madly in love with each other

 

caterpillar

examining sample chains: beginning

KruschkeFig7.10

examining sample chains: rest

KruschkeFig7.11

burn-in & thinning

 

burn-in

remove initial chunk of sample sequence to discard effects of (random) starting position

 

thinning

only consider every \(i\)-th sample to remove autocorrelation

R hat

\(\hat{R}\)-statistics,

  • a.k.a.:
    • Gelman-Rubin statistics
    • shrink factor
    • potential scale reduction factor
  • idea:
    • compare variance within a chain to variance between chains
  • in practice:
    • use software to compute it
    • aim for \(\hat{R} \le 1.1\) for all continuous variables

R hat in 'coda'

gelman.diag(out)
## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
## mu             1       1.01
## sigma          1       1.00
## 
## Multivariate psrf
## 
## 1

R hat in 'ggmcmc'

ggs_Rhat(out2)

autocorrelation

KruschkeFig7.12

autocorrelation in 'coda'

autocorr.plot(out)

autocorrelation in 'ggmcmc'

ggs_autocorrelation(out2)

effective sample size

  • intuition:
    • how many samples are "efficient, actual samples" if we strip off autocorrelation
  • definition:

\[\text{ESS} = \frac{N}{ 1 + 2 \sum_{k=1}^{\infty} ACF(k)}\]

effective sample size in 'coda'

effectiveSize(out)
##       mu    sigma 
## 2395.912 2138.160

fini

outlook

 

Tuesday

  • introduction to JAGS

 

Friday

  • 1st practice session

to prevent boredom

 

obligatory

  • prepare Kruschke chapter 8

 

optional

  • install JAGS & peak into its documentation
    • careful: most recent JAGS version is 4.2.0, but most recent documentation is for 4.0.0
  • browse Gelman et al (2014) Part III for more on Bayesian computation
    • more later in this course as well