Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

key notions

(recap) 3 pillars of Bayesian data analysis:
- estimation
- comparison
- prediction
maximum likelihood estimation
Akaike's information criterion
- free parameters
- model complexity
Bayes factors
- Savage-Dickey method (next time!)

3 pillars of BDA

estimation

given model and data, which parameter values should we believe in?

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

which of two models is more likely, given the data?

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

prediction

which future observations do we expect (after seeing some data)?

prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]

posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

requires sampling distribution (more on this later)

special case: prior/posterior predictive \(p\)-value (model criticism)

preliminaries

summary

	standard	Bayes
today's focus	Akaike's information criterion	Bayes factor
what counts as model to be compared?	likelihood function \(P(D \mid \theta)\)	likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models?	ex post: after seeing the data	ex ante: before seeing the data
how to penalize model complexity?	count number of free parameters	implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end?	no	yes
hard to compute?	relatively easy	relatively hard

maximum likelihood

forgetting data

subjects each where asked to remember items (Murdoch 1961)
recall rates y for 100 subjects after time t in seconds

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

power model

\[P(t \ ; \ c, d) = ct^{-d}\]

\[\text{where } c,d>0 \]

negative log-likelihood

exponential

nLL.exp <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | 
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*exp(-w2*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.exp(1,1))

## [1] 1092.446

power

nLL.pow <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 |
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*t^(-w2)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.pow(1,1))

## [1] 142.1096

MLE

require('stats4')
bestExpo = mle(nLL.exp, start = list(w1=1,w2=1))
bestPow  = mle(nLL.pow, start = list(w1=1,w2=1))
MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2),
                         parameter = c("a", "b", "c", "d"),
                         value = c(coef(bestExpo), coef(bestPow)))
knitr::kable(MLEstimates)

model	parameter	value
expo	a	1.0701173
expo	b	0.1308300
power	c	0.9531186
power	d	0.4979311

best fits visualization

model comparison

which model is better?

predExp = expo(t,a,b)
predPow = power(t,c,d)
modelStats = data.frame(model = c("expo", "power"),
                        logLike = round(c(logLik(bestExpo), logLik(bestPow)),3),
                        pData = exp(c(logLik(bestExpo), logLik(bestPow))),
                        r = round(c(cor(predExp, y), cor(predPow,y)),3),
                        LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3))
modelStats

##   model logLike        pData     r   LSE
## 1  expo -18.666 7.820887e-09 0.982 0.019
## 2 power -26.726 2.471738e-12 0.948 0.057

Akaike information criterion

motivation

model is better, the higher \(P(D \mid \hat{\theta})\)
- where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\) is the maximum likelihood estimate
model is worse, the more parameters it has
- principle of parsimony (Ockham's razor)
information theoretic notion:
- amount of information lost when assuming data was generated with \(\hat{\theta}\)

definition

Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - 2\ln P(D \mid \hat{\theta})\]

The smaller the AIC, the better the model.

model comparison by AIC

##   model logLike        pData     r   LSE      AIC
## 1  expo -18.666 7.820887e-09 0.982 0.019 41.33294
## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220

show(AIC(bestExpo, bestPow))

##          df      AIC
## bestExpo  2 41.33294
## bestPow   2 57.45220

concluding remarks

given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
"model" for AICs is just likelihood; no prior
discounting number of parameters, like AIC does, does not take effective strength of parameters into account
- think: individual-level parameters harnessed by hierarchical population-level prior
there are other information criteria that overcome some of these problems:
- Bayesian information criterion
- deviance information criterion

Bayes factors

take two models:
- \(P(\theta_1 \mid M_1)\) and \(P(D \mid \theta_1, M_1)\)
- \(P(\theta_2 \mid M_2)\) and \(P(D \mid \theta_2, M_2)\)
ideally, we'd want to know the absolute probability of \(M_i\) given the data
- but then we'd need to know set of all models (for normalization)
alternatively, we take odds of models given the data:

The Bayes factor is the factor by which our prior odds are changed by the data.

marginal likelihood

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

marginal likelihood of data under model \(M_i\)

\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

we marginalize out parameters \(\theta_i\)
this is a function of the prior and the likelihood

how to interpret Bayes factors

BF(M1 > M2)	interpretation
1	irrelevant data
1 - 3	hardly worth ink or breath
3 - 6	anecdotal
6 - 10	now we're talking: substantial
10 - 30	strong
30 - 100	very strong
100 +	decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

get each model's marginal likelihood
- grid approximation
  (today)
- brute force clever math
  (next time)
- by Monte Carlo sampling
  (next time)
get Bayes factor directly
- Savage-Dickey method
  (today)
- transdimensional MCMC
  (next time)

grid approximation

consider discrete values for \(\theta\)
compute evidence in terms of them
works well for low-dimensional \(\theta\)

example

priorExpo = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExpo = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100)) # no log!
}

priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100)) # no log!
}

example

flat priors cancel out

grid = seq(0.005, 1.495, by = 0.01)
margLikeExp = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhExpo(a,b)))) )
margLikePow = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhPow(a,b)))) )
show(as.numeric(margLikeExp / margLikePow))

## [1] 1221.392

overwhelming evidence in favor of the exponential model

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

proof

\(M_0\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi = \phi_0\)
\(M_1\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi\) free to vary
crucial assumption: \(\lim_{\phi \rightarrow \phi_0} P(\psi \mid \phi, M_1) = P(\psi \mid M_2)\)
rewrite marginal likelihood under \(M_0\):

\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]

fini

summary

	standard	Bayes
today's focus	Akaike's information criterion	Bayes factor
what counts as model to be compared?	likelihood function \(P(D \mid \theta)\)	likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models?	ex post: after seeing the data	ex ante: before seeing the data
how to penalize model complexity?	count number of free parameters	implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end?	no	yes
hard to compute?	relatively easy	relatively hard

outlook

Friday

boot camp on model comparison & Savage-Dickey

Tuesday

more on Bayes factor computation & model comparison
- Monte Carlo simulation to approximate marginal likelihood
- transdimensional MCMC

to prevent boredom

obligatory

read Lee & Wagenmakers chapter 7

optional

skim examples from Lee & Wagenmakers chapters 8 & 9