\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

key notions

  • (recap) 3 pillars of Bayesian data analysis:
    • estimation
    • comparison
    • prediction
  • maximum likelihood estimation
  • Akaike's information criterion
    • free parameters
    • model complexity
  • Bayes factors
    • Savage-Dickey method (next time!)

3 pillars of BDA

estimation

 

given model and data, which parameter values should we believe in?

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

which of two models is more likely, given the data?

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

prediction

 

which future observations do we expect (after seeing some data)?

 

prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]

 

posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

 

   

requires sampling distribution (more on this later)

special case: prior/posterior predictive \(p\)-value (model criticism)

preliminaries

summary

standard Bayes
today's focus Akaike's information criterion Bayes factor
what counts as model to be compared? likelihood function \(P(D \mid \theta)\) likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models? ex post: after seeing the data ex ante: before seeing the data
how to penalize model complexity? count number of free parameters implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end? no yes
hard to compute? relatively easy relatively hard

maximum likelihood

forgetting data

  • subjects each where asked to remember items (Murdoch 1961)
  • recall rates y for 100 subjects after time t in seconds
y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

power model

\[P(t \ ; \ c, d) = ct^{-d}\]

\[\text{where } c,d>0 \]

negative log-likelihood

exponential

nLL.exp <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | 
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*exp(-w2*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.exp(1,1))
## [1] 1092.446

power

nLL.pow <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 |
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*t^(-w2)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.pow(1,1))
## [1] 142.1096

MLE

require('stats4')
bestExpo = mle(nLL.exp, start = list(w1=1,w2=1))
bestPow  = mle(nLL.pow, start = list(w1=1,w2=1))
MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2),
                         parameter = c("a", "b", "c", "d"),
                         value = c(coef(bestExpo), coef(bestPow)))
knitr::kable(MLEstimates)
model parameter value
expo a 1.0701173
expo b 0.1308300
power c 0.9531186
power d 0.4979311

best fits visualization

model comparison

which model is better?

predExp = expo(t,a,b)
predPow = power(t,c,d)
modelStats = data.frame(model = c("expo", "power"),
                        logLike = round(c(logLik(bestExpo), logLik(bestPow)),3),
                        pData = exp(c(logLik(bestExpo), logLik(bestPow))),
                        r = round(c(cor(predExp, y), cor(predPow,y)),3),
                        LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3))
modelStats
##   model logLike        pData     r   LSE
## 1  expo -18.666 7.820887e-09 0.982 0.019
## 2 power -26.726 2.471738e-12 0.948 0.057

Akaike information criterion

Akaike information criterion

motivation

  • model is better, the higher \(P(D \mid \hat{\theta})\)
    • where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\) is the maximum likelihood estimate
  • model is worse, the more parameters it has
    • principle of parsimony (Ockham's razor)
  • information theoretic notion:
    • amount of information lost when assuming data was generated with \(\hat{\theta}\)

definition

Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - 2\ln P(D \mid \hat{\theta})\]

The smaller the AIC, the better the model.

model comparison by AIC

##   model logLike        pData     r   LSE      AIC
## 1  expo -18.666 7.820887e-09 0.982 0.019 41.33294
## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220
show(AIC(bestExpo, bestPow))
##          df      AIC
## bestExpo  2 41.33294
## bestPow   2 57.45220

concluding remarks

  • given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
  • "model" for AICs is just likelihood; no prior
  • discounting number of parameters, like AIC does, does not take effective strength of parameters into account
    • think: individual-level parameters harnessed by hierarchical population-level prior
  • there are other information criteria that overcome some of these problems:
    • Bayesian information criterion
    • deviance information criterion

Bayes factors

Bayes factors

  • take two models:
    • \(P(\theta_1 \mid M_1)\) and \(P(D \mid \theta_1, M_1)\)
    • \(P(\theta_2 \mid M_2)\) and \(P(D \mid \theta_2, M_2)\)
  • ideally, we'd want to know the absolute probability of \(M_i\) given the data
    • but then we'd need to know set of all models (for normalization)
  • alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

marginal likelihood

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

marginal likelihood of data under model \(M_i\)

\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

  • we marginalize out parameters \(\theta_i\)
  • this is a function of the prior and the likelihood

how to interpret Bayes factors

BF(M1 > M2) interpretation
1 irrelevant data
1 - 3 hardly worth ink or breath
3 - 6 anecdotal
6 - 10 now we're talking: substantial
10 - 30 strong
30 - 100 very strong
100 + decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

  1. get each model's marginal likelihood
    • grid approximation
      (today)
    • brute force clever math
      (next time)
    • by Monte Carlo sampling
      (next time)
  2. get Bayes factor directly
    • Savage-Dickey method
      (today)
    • transdimensional MCMC
      (next time)

grid approximation

grid approximation

  • consider discrete values for \(\theta\)
  • compute evidence in terms of them
  • works well for low-dimensional \(\theta\)

example

priorExpo = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExpo = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100)) # no log!
}
priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100)) # no log!
}

example

flat priors cancel out

grid = seq(0.005, 1.495, by = 0.01)
margLikeExp = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhExpo(a,b)))) )
margLikePow = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhPow(a,b)))) )
show(as.numeric(margLikeExp / margLikePow))
## [1] 1221.392

overwhelming evidence in favor of the exponential model

Savage-Dickey method

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

proof

  • \(M_0\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi = \phi_0\)
  • \(M_1\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi\) free to vary
  • crucial assumption: \(\lim_{\phi \rightarrow \phi_0} P(\psi \mid \phi, M_1) = P(\psi \mid M_2)\)
  • rewrite marginal likelihood under \(M_0\):

\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]

fini

summary

standard Bayes
today's focus Akaike's information criterion Bayes factor
what counts as model to be compared? likelihood function \(P(D \mid \theta)\) likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models? ex post: after seeing the data ex ante: before seeing the data
how to penalize model complexity? count number of free parameters implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end? no yes
hard to compute? relatively easy relatively hard

outlook

 

Friday

  • boot camp on model comparison & Savage-Dickey

 

Tuesday

  • more on Bayes factor computation & model comparison
    • Monte Carlo simulation to approximate marginal likelihood
    • transdimensional MCMC

to prevent boredom

 

obligatory

  • read Lee & Wagenmakers chapter 7

 

optional

  • skim examples from Lee & Wagenmakers chapters 8 & 9