Bayesian data analysis

The road ahead

theory

posterior inference & credible values
Bayes factors & model comparison
comparison of Bayesian and "classical" NHST

practice

basics of MCMC sampling
tools for BDA ( JAGS, Stan, WebPPL, Jasp, rstanarm)
Bayesian cognitive modeling example

sampling from the posterior

key notions

Markov Chain Monte Carlo methods
- Metropolis Hastings
- Gibbs
- Hamiltonian MC
convergence / representativeness
- trace plots
- \(\hat{R}\)
efficiency
- autocorrelation
- effective sample size

recap

Bayes rule for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \times \underbrace{P(D \, | \, \theta)}_{likelihood}\]

normalizing constant:

\[ \int P(\theta') \times P(D \mid \theta') \, \text{d}\theta' = P(D) \]

problem

how to calculate for unwieldy models with high-dimensional \(\theta\)?

approximation by sampling

problem statement

notation

\(X = (X_1,\dots,X_k)\) is a vector of random variables
true distribution of \(X\) is \(P\)

desideratum

an efficient way of getting samples \(x \sim P\) without access to \(P\) itself

Markov chain

intuition

a sequence of elements from \(X\), \(x_0, x_1, \dots, x_n\) such that every \(x_{i+1}\) depends only on its predecessor \(x_i\)
- think: probabilistic finite state machine

FSA

Markov property

\[P(X_{n+1} = x_{n+1} \mid X_0 = x_0, \dots, X_n = x_n) = P(X_{n+1} = x_n \mid X_n = x_n)\]

Markov Chain Monte Carlo methods

idea

get sequence of samples \(x\) s.t. every sample depends only on predecessor
the stationary distribution of the chain is \(P\)
in the limit, then, \(x\) are representative samples of \(P\), \(x \sim P\)

dummy

benefits & pitfalls

good

can approximate any distribution
no need for normalizing constant
only prior and likelihood required

bad

may take a while
autocorrelation
best MCMC algo depends on case

Metropolis Hastings

island hopping

islands

set of islands \(X = \{x_1, x_2, \dots x_n\}\)
goal: hop around & visit every island \(x_i\) proportional to its population \(P(x_i)\)
- think: "samples" from \(x \sim P\)
problem: island hopper can remember at most 2 islands' population
- think: we don't know the normalizing constant

Metropolis Hastings

let \(f(x) = \alpha P(x)\) (e.g., unnormalized posterior)
start at random \(x^0\), define probability \(P_\text{trans}(x^i \rightarrow x^{i+1})\) of going from \(x^{i}\) to \(x^{i+1}\)
- proposal \(P_\text{prpsl}(x^{i+1} \mid x^i)\): prob. of considering jump to \(x^{i+1}\) from \(x^{i}\)
- acceptance \(P_\text{accpt}(x^{i+1} \mid x^i)\): prob of making jump when \(x^{i+1}\) is proposed \[P_\text{accpt}(x^{i+1} \mid x^i) = \text{min} \left (1, \frac{f(x^{i+1})}{f(x^{i})} \frac{P_\text{prpsl}(x^{i} \mid x^{i+1})}{P_\text{prpsl}(x^{i+1} \mid x^i)} \right)\]
- transition \(P_\text{trans}(x^i \rightarrow x^{i+1}) = P_\text{prpsl}(x^{i+1} \mid x^i) \times P_\text{accpt}(x^{i+1} \mid x^i)\)

properties MH

motto: always up, down with probability \(\frac{f(x^{i+1})}{f(x^{i})}\)
ratio \(\frac{f(x^{i+1})}{f(x^{i})}\) means that we can forget about normalizing constant
\(P_\text{trans}(x^i \rightarrow x^{i+1})\) defines transition matrix -> Markov chain analysis!
for suitable proposal distributions:
- stationary distribution exists (first left-hand eigenvector)
- every initial condition converges to stationary distribution
- stationary distribution is \(P\)

influence of proposal distribution

KruschkeFig7.4

Gibbs sampling

Gibbs sampling: introduction

Metropolis is a very general and versatile algorithm
however in specific situations, other algorithms may be better applicable
for example, if \(X=(X_1,\dots,X_k)\) is a vector of random variables and \(P(x)\) is a multidimensional distribution
in such cases, the Gibbs sampler may be helpful
basic idea: split the multidimensional problem into a series of simpler problems of lower dimensionality

Gibbs sampling: main idea

assume \(X=(X_1,X_2,X_3)\) and \(p(x)=P(x_1,x_2,x_3)\)
start with \(x^0=(x_1^0,x_2^0,x_3^0)\)
at iteration \(t\) of the Gibbs sampler, we have \(x^{i-1}\) and need to generate \(x^{i}\)
- \(x_1^{i} \sim P(x_1 \mid x_2^{i-1},x_3^{i-1})\)
- \(x_2^{i} \sim P(x_2 \mid x_1^{i},x_3^{i-1})\)
- \(x_3^{i} \sim P(x_3 \mid x_1^{i},x_2^{i})\)
for a large \(n\), \(x^n\) will be a sample from \(P(x)\)
- \(x^n_k\) will be a sample from \(P(x_k)\) (marginal distribution)

example: Gibbs for 2 coin flips

KruschkeFig7.8

example: MH for 2 coin flips

KruschkeFig7.6

summary

Gibbs sampling can be more efficient than MH
Gibbs needs samples from conditional posterior distribution
- MH is more generally applicable
preformance of MCMC algorithms depends on proposal distribution
clever software helps determines which algorithm to use and how to fine-tune it

Hamiltonian MC

Hamiltonian motion

Hamiltonian movement
- reformulates classical mechanics
- precursor of statistical physics
- closed system differential equations
- potential vs. kinetic energy

Hamilton

dummy dummy dummy

potential

kinetic

example

KruschkeFig14.1

tuning parameters

momentum
- e.g., standard deviation of Gaussian that determines initial jiggle

dummy

step size
- how big a step to take in discretization of gradient

dummy

number of steps
- how many steps before producing the proposal

assessing sample chains

problem statements

convergence/representativeness

we have samples from MCMC, ideally several chains
in the limit, samples must be representative of \(P\)
how do we know that our meagre finite samples are representative?

efficiency

ideally, we'd like as small a sample set as possible
how do we measure that we have "enough" samples?

examining sample chains: beginning

KruschkeFig7.10

examining sample chains: rest

KruschkeFig7.11

\(\hat{R}\)

\(\hat{R}\)-statistics

a.k.a.:
- Gelman-Rubin statistics
- shrink factor
- potential scale reduction factor
idea:
- compare variance within a chain to variance between chains
- [think: ANOVA]
in practice:
- use software to compute it
- aim for \(\hat{R} \le 1.1\) for all continuous variables

autocorrelation

KruschkeFig7.12

effective sample size

intuition:
- how many samples are "efficient, actual samples" if we strip off autocorrelation
definition:

\[\text{ESS} = \frac{N}{ 1 + 2 \sum_{k=1}^{\infty} ACF(k)}\]

tools for Bayesian inference

overview

	JAGS	Stan	WebPPL
flavor	declarative	imperative	functional
feels like	BUGS + R	BUGS + C	JavaScript
methods	MH, Gibbs, …	HMC, variational, …	MH, MHM, rejection, …
interfaces	R, Python, Matlab	R, Python, Matlab, …	R, (…?)
goodies		rstanarm, loo	browser-based
developers	Martyn Plummer	Stan Dev. Team (Gelman, Carpenter, …)	N. Goodman & A. Stuhlmüller et al.
websites	JAGS	Stan	WebPPL

JAGS

overview

declarative model specification
- model as a directed acyclic graph / Bayes nets
  - nodes are variables, edges are deterministic or probabilistic dependencies
- order invariant
- no variable reassignment, no control flow statements etc.
  - exception: "ifelse" and "step" commands for boolean switching
  - "for"-loops for vector construction
automatically selects adequate sampler
call from & process output, e.g., in R via packages:
- R2Jags
- rjags
- runjags

example

require('rjags')
modelString = "
model{
  theta ~ dbeta(1,1)
  k ~ dbinom(theta, N)
}"
# prepare for JAGS
dataList = list(k = 14, N = 20)
# set up and run model
jagsModel = jags.model(file = textConnection(modelString), 
                       data = dataList,
                       n.chains = 2)
update(jagsModel, n.iter = 5000)
codaSamples = coda.samples(jagsModel, 
                           variable.names = c("theta"),
                           n.iter = 5000)

example

require('ggmcmc')
require('dplyr')
ms = ggs(codaSamples)
ms %>% group_by(Parameter) %>% 
   summarise(mean = mean(value),
             HDIlow  = HPDinterval(as.mcmc(value))[1],
             HDIhigh = HPDinterval(as.mcmc(value))[2])

## # A tibble: 1 × 4
##   Parameter      mean    HDIlow   HDIhigh
##      <fctr>     <dbl>     <dbl>     <dbl>
## 1     theta 0.6814352 0.4967396 0.8756536

example

  tracePlot = ggs_traceplot(ms)  
  densPlot = ggs_density(ms) + 
    stat_function(fun = function(x) dbeta(x, 15,7), color = "black")

  require('gridExtra')
  grid.arrange(tracePlot, densPlot, ncol = 2)

another example

Your turn: what's this model good for?

obs is a vector of 100 observations (real numbers)

model{
  mu ~ dunif(-2,2)
  var ~ dunif(0, 100)
  tau = 1/var
  for (i in 1:100){
    obs[i] ~ dnorm(mu,tau)
  }
}

NB: JAGS' precision \(\tau = 1/\sigma^2\), not standard deviation \(\sigma\) in dnorm

model specifications, formally

model{
  mu ~ dunif(-2,2)
  var ~ dunif(0, 100)
  tau = 1/var
  for (i in 1:100){
    obs[i] ~ dnorm(mu,tau)
  }
}

\[\mu \sim \text{Unif}(-2,2)\] \[\sigma^2 \sim \text{Unif}(0,100)\] \[obs_i \sim \text{Norm}(\mu, \sigma)\]

Stan

background

mc-stan.org
actively developed by Andrew Gelman, Bob Carpenter and others
uses HMC and NUTS, but can do variational Bayes and MAP too
written in C++
cross-platform
interfaces: command line, Python, Julia, R, Stata, Matlab

example: JAGS vs. Stan

require('rjags')
ms = "
model{
  theta ~ dbeta(1,1)
  k ~ dbinom(theta, N)
}"
# prepare data for JAGS
dataList = list(k = 14, N = 20)
# set up and run model
jagsModel = jags.model(
  file = textConnection(ms), 
  data = dataList, n.chains = 2)
update(jagsModel, n.iter = 5000)
codaSamples = coda.samples(
  jagsModel, 
  variable.names = c("theta"),
  n.iter = 5000)

require('rstan')
ms = "
data {
  int<lower=0> N ;
  int<lower=0> k ;
}
parameters {
  real<lower=0,upper=1> theta ;
} 
model {
  theta ~ beta(1,1) ;
  k ~ binomial(N, theta) ;
}"
# prepare data for Stan
dataList = list(k = 14, N = 20)
# set up and run
mDso = stan_model(model_code = ms)
stanFit = sampling(object = mDso,
  data = dataList,
  chains = 2,
  iter = 10000,
  warmup = 5000)

Stan output

plot(stanFit)

stanFitPlot

data wrangling

codaSamples = As.mcmc.list(stanFit)
ms = ggs(stanFit)
ggs_density(ms)

stanFitDensity

discrete latent parameters in Stan

problem

Stan needs gradients / derivatives
- cannot infer discrete latent parameters

dummy

solution

marginalize out

Exam scores

15 people answered 40 exam questions. Here's their scores:

k <- c(21,17,21,18,22,31,31,34,34,35,35,36,39,36,35)
p <- length(k) #number of people
n <- 40 # number of questions

assume that some guessed randomly, others gave informed answers
who is a guesser?
what's the probability of correct answering for non-guessers?

dummy dummy dummy dummy dummy dummy

[from Lee & Wagenmakers (2015), Chapter 6.1]

JAGS model

modelGraph

model{
  for (i in 1:p){
    z[i] ~ dbern(0.5)
  }
  psi = 0.5
  phi ~ dunif(0.5,1)
  for (i in 1:p){
    theta[i] = ifelse(z[i]==0,psi,phi)
    k[i] ~ dbin(theta[i],n)
  }
}

JAGS model: results

k <- c(21,17,21,18,22,31,31,34,34,35,35,36,39,36,35)
show(summary(codaSamples)[[1]][,1:2])

##           Mean         SD
## phi   0.863443 0.01719438
## z[1]  0.000000 0.00000000
## z[2]  0.000000 0.00000000
## z[3]  0.000000 0.00000000
## z[4]  0.000000 0.00000000
## z[5]  0.000000 0.00000000
## z[6]  0.992200 0.08797689
## z[7]  0.994500 0.07396146
## z[8]  1.000000 0.00000000
## z[9]  0.999900 0.01000000
## z[10] 1.000000 0.00000000
## z[11] 1.000000 0.00000000
## z[12] 1.000000 0.00000000
## z[13] 1.000000 0.00000000
## z[14] 1.000000 0.00000000
## z[15] 1.000000 0.00000000

Stan model

data { 
  int<lower=1> p;  int<lower=0> k[p];   int<lower=1> n;
}
transformed data {
  real psi; psi <- .5;
}
parameters {
  real<lower=.5,upper=1> phi; 
} 
transformed parameters {
  vector[2] log_alpha[p];
  for (i in 1:p) {
    log_alpha[i,1] <- log(.5) + binomial_log(k[i], n, phi);
    log_alpha[i,2] <- log(.5) + binomial_log(k[i], n, psi); 
  }
}
model {
  for (i in 1:p)
    increment_log_prob(log_sum_exp(log_alpha[i]));  
}
generated quantities {
  int<lower=0,upper=1> z[p];
  for (i in 1:p) {
    z[i] <- bernoulli_rng(softmax(log_alpha[i])[1]);
  }
}

log-scores & marginalizing out

log score

\(\log P(D, \theta) = \log \left( P(\theta) \ P(D \mid \theta) \right ) = \log P(\theta) + \log P(D \mid \theta)\)
log-score for one participant in our example:

\[\log P(k_i, n, z_i, \phi) = \log \text{Bern}(z_i \mid 0.5) + \log \text{Binom}(k_i, n, \text{ifelse}(z=1, \phi, 0.5))\]

marginalizing out

\[ \begin{align*} P(k_i, n, \phi) & = \sum_{z_i=0}^1 \text{Bern}(z_i \mid 0.5) \ \text{Binom}(k_i, n, \text{ifelse}(z=1, \phi, 0.5)) \\ & = \text{Bern}(0 \mid 0.5) \ \text{Binom}(k_i, n, 0.5)) + \\ & \ \ \ \ \ \ \ \ \ \ \text{Bern}(1 \mid 0.5) \ \text{Binom}(k_i, n, 1)) \\ & = \underbrace{0.5 \ \text{Binom}(k_i, n, 0.5))}_{\alpha_0} + \underbrace{0.5 \ \text{Binom}(k_i, n, 1))}_{\alpha_1} \\ \log P(k_i, n, \phi) & = \log \sum_{i=0}^1 \exp \log \alpha_i \end{align*}\]

WebPPL

example

library(rwebppl)
datainput = list(n = 24, k = 7)
modelString = ' 
var successes = dataFromR["k"][0];
var trials = dataFromR["n"][0];
var model2compute = function(){
  var p = uniform(0,1); 
  observe(Binomial({n: trials, p: p}), successes);
  return {bias: p}; }'
wp.rs = webppl(modelString, model_var = "model2compute", 
               data = datainput,
               data_var = "dataFromR",
               inference_opts = list(method = "MCMC",
                                     samples = 15000,
                                     burn = 5000, verbose = TRUE),
               output_format = "ggmcmc",
               chains = 2, cores = 2)

WebPPL demo

DEMO

playground

binomial data

JASP

goals

- liberate psychology from expensive, proprietory software (SPSS)
- provide Bayesian statistical methods in an accessible way
- provide a universal platform for publishing statistical analysis with rich UIs

JASP

DEMO

rstanarm

regressions in R

classical

require('datasets')
head(cars)
m1 = glm(speed ~ dist, data = cars)

rstanarm

require('rstanarm')
m2 = stan_glm(speed ~ dist, data = cars)

[check this vignette, and this great tutorial paper for more]