Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

(Bayesian) model comparison

which of two models is more likely, given the data?

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

summary

	standard	Bayes
today's focus	Akaike's information criterion	Bayes factor
what counts as model to be compared?	likelihood function \(P(D \mid \theta)\)	likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models?	ex post: after seeing the data	ex ante: before seeing the data
how to penalize model complexity?	count number of free parameters	implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end?	no	yes
hard to compute?	relatively easy	relatively hard

roadmap for today

Bayes factor computation
- exact computation
- Savage-Dickey method (recap)
- grid approximation
- Monte Carlo simulation
  - naive
  - importance sampling
- transdimensional MCMC

Savage Dickey method

example: model comparison for coin flip data

observed: \(k = 7\) out of \(N = 24\) coin flips were successes
goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
- \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
- \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

exact calculation

data: \(k = 7\) successes in \(N = 24\) trials

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 \binom{N}{k} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{N}}{\binom{N}{k} \int_0^1 \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

exact calculation (cont.)

unraveling beta function

\[ \begin{align*} BetaFunction(k+1, N-k+1) & = \frac{\Gamma(k+1) \Gamma(N-k+1)}{\Gamma(N+2)} \\ & = \frac{(k)! (N-k)!}{(N+1)!} \\ & = \frac{(k)! (N-k)!}{(N)!} \frac{1}{N+1} \\ & = \frac{1}{\binom{N}{k}} \frac{1}{N+1} \\ \end{align*} \]

so:

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \\ & = \binom{N}{k} 0.5^{N} (N + 1) \approx 0.516 \end{align*} \]

properly nested models

suppose that ther are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
\(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
\(M_0\) is properly nested under \(M_1\) if:
- \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
- \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
- \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

proof

\(M_0\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi = \phi_0\)
\(M_1\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi\) free to vary
crucial assumption: \(\lim_{\phi \rightarrow \phi_0} P(\psi \mid \phi, M_1) = P(\psi \mid M_0)\)
rewrite marginal likelihood under \(M_0\):

\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]

memory models

forgetting data

subjects each where asked to remember items (Murdoch 1961)
recall rates y for 100 trials after time t in seconds

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

model specification (priors & likelihood function)

exponential

priorExp = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExp = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

power

priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

grid approximation

consider discrete values for \(\theta\)
compute evidence in terms of them
works well for low-dimensional \(\theta\)

example

lhExp = Vectorize(lhExp)
lhPow = Vectorize(lhPow)

stepsize = 0.01
evidence = expand.grid(x = seq(0.005, 1.495, by = stepsize),
                       y = seq(0.005, 1.495, by = stepsize)) %>% 
  mutate(lhExp = lhExp(x,y), priExp = 1 / length(x),  # uniform priors!
         lhPow = lhPow(x,y), priPow = 1 / length(x))

paste0("BF in favor of exponential model: ", 
            with(evidence, sum(priExp*lhExp)/ sum(priPow*lhPow)) %>% round(2))

## [1] "BF in favor of exponential model: 1221.39"

overwhelming evidence in favor of the exponential model

naive Monte Carlo simulation

recap: why simulate?

generally:

\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]

in particular:

\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]

example: naive sampling

nSamples = 200000
a = runif(nSamples, 0, 1.5)
b = runif(nSamples, 0, 1.5)
lhExpVec = lhExp(a,b)
lhPowVec = lhPow(a,b)
paste0("BF in favor of exponential model: ", 
            with(evidence, sum(lhExpVec)/ sum(lhPowVec)) %>% round(2))

## [1] "BF in favor of exponential model: 1182.32"

time course of estimate

BFVec = sapply(seq(10000,nSamples, by = 200), 
     function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i]))
ggplot(data.frame(i = seq(10000,nSamples, by = 200), BF = BFVec), aes(x=i, y=BF)) +
  geom_line() + geom_hline(yintercept = 1221, color = "firebrick") + xlab("number of samples")

importance sampling

take an arbitrary function \(h(\theta)\) such that \(\int h(\theta) \text{d}\theta = 1\)

\[ \begin{align*} \frac{1}{P(D)} & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)}\\ & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \int h(\theta) \text{d}\theta \\ & = \int \frac{h(\theta) P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \text{d}\theta \\ & \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta \mid D)} \frac{h(\theta)}{P(D \mid \theta) P(\theta)} \end{align*} \]

choose a \(h(\theta)\) that resembles the posterior

JAGS model

model{
  m ~ dbern(0.5)
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posterior over parameters for exponential model

modelString = "
model{
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = a*exp(-t[i]*b)
    p[i] = min(max( pT[i], 0.00001), 0.99999)
    obs[i] ~ dbinom(p[i], 100)
  }
}"
# prepare for JAGS
dataList = list(obs = obs, t = t)
# set up and run model
jagsModel = jags.model(file = textConnection(modelString), 
                       data = dataList,
                       n.chains = 2)
update(jagsModel, n.iter = 25000)
codaSamples = coda.samples(jagsModel, 
                           variable.names = c("a", "b"),
                           n.iter = 5000)
ms = ggs(codaSamples)

parametric approximation for posteriors

getGammaApprox = function(samples){
  s = sd(samples)
  m = mean(samples)
  ra = ( m + sqrt( m^2 + 4*s^2 ) ) / ( 2 * s^2 )
  sh = 1 + m * ra
  return(c(shape = sh, rate = ra))
}
wideSamples = dcast(ms, Iteration + Chain ~ Parameter)
paramA = getGammaApprox(filter(ms, Parameter == "a")$value)
paramB = getGammaApprox(filter(ms, Parameter == "b")$value)

parameter estimation for power model

summary of estimated parameters

##   parameter     shape      rate
## 1         a 1121.0532 1056.1074
## 2         b  202.4942 1549.5401
## 3         c 2331.6908 2462.5925
## 4         d  232.5322  465.8463

implement these posterior approximations

homework!

transcendental MCMC

idea

make model index a parameter

dummy dummy

probMod

JAGS model

model{
  m ~ dbern(0.5)
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posteriors

trickery

tweaking prior model odds & using pseudo-priors

model{
  mT ~ dbern(0.9995)
  m = mT + 1
  a[1] ~ dunif(0,1.5)
  b[1] ~ dunif(0,1.5)
  c[2] ~ dunif(0,1.5)
  d[2] ~ dunif(0,1.5)
  a[2] ~ dgamma(1124.84336, 1058.83299)
  b[2] ~ dgamma(199.82804, 1527.78959)
  c[1] ~ dgamma(61.88311, 29.28403)
  d[1] ~ dgamma(115.44879, 126.33878)
for (i in 1: 6){
    pExp[i] = a[m]*exp(-t[i]*b[m])
    pPow[i] = c[m]*t[i]^(-d[m]) 
    pT[i] = ifelse(m == 1, pExp[i], pPow[i])
    p[i] = min(max( pT[i], 0.00001), 0.99999)
    obs[i] ~ dbinom(p[i], 100)
  }  
}

posteriors

## [1] "BF in favor of Exponential Model (using transdimensional MCMC): 948.77"

fini

summary

	standard	Bayes
today's focus	Akaike's information criterion	Bayes factor
what counts as model to be compared?	likelihood function \(P(D \mid \theta)\)	likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models?	ex post: after seeing the data	ex ante: before seeing the data
how to penalize model complexity?	count number of free parameters	implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end?	no	yes
hard to compute?	relatively easy	relatively hard

outlook

Friday

philosophy of Bayes Factors, some more tricks and applications

Tuesday

no class!