Model Comparison

Michael Franke

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

roadmap

  • (recap) 3 pillars of Bayesian data analysis
    • estimation
    • comparison
    • prediction
  • Akaike’s information criterion
    • maximum likelihood estimation
    • free parameters
    • (discounting) model complexity
  • Bayes factors
    • grid approximation
    • naive Monte Carlo

recap: pillars of BDA

parameter estimation

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

 

prior & posterior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \] \[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

outlook

standard approach

  • Akaike information criterion
  • likelihood function \(P(D\mid theta)\)
  • maximum likelihood estimate \[\hat{\theta} \in \arg\max_\theta P(D\mid \theta)\]
  • ex post: model uses data
  • counts number of free parameters
  • relatively easy to compute

Bayes

  • Bayes factors
  • likelihood function \(P(D\mid \theta)\) & prior \(P(\theta)\)
  • marginalized (prior) likelihood \[\int P(D \mid \theta) \ P(\theta) \text{d}\theta\]
  • ex ante: model doesn’t use data
  • implicitly weighs in effectiveness of parameters
  • relatively hard to compute

case study forgetting data

  • subjects each where asked to remember items (Murdoch 1961)
  • recall rates y for 100 subjects after time t in seconds
y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

power model

\[P(t \ ; \ c, d) = ct^{-d}\]

\[\text{where } c,d>0 \]

negative log-likelihood

exponential

nLL.exp <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | 
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*exp(-w2*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.exp(1,1))
## [1] 1092.446

power

nLL.pow <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 |
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*t^(-w2)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.pow(1,1))
## [1] 142.1096

MLE

require('stats4')
bestExpo = mle(nLL.exp, start = list(w1=1,w2=1))
bestPow  = mle(nLL.pow, start = list(w1=1,w2=1))
MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2),
                         parameter = c("a", "b", "c", "d"),
                         value = c(coef(bestExpo), coef(bestPow)))
knitr::kable(MLEstimates)
model parameter value
expo a 1.0701173
expo b 0.1308300
power c 0.9531186
power d 0.4979311

best fits visualization

model comparison

which model is better?

predExp = expo(t,a,b)
predPow = power(t,c,d)
modelStats = data.frame(model = c("expo", "power"),
                        logLike = round(c(logLik(bestExpo), logLik(bestPow)),3),
                        pData = exp(c(logLik(bestExpo), logLik(bestPow))),
                        r = round(c(cor(predExp, y), cor(predPow,y)),3),
                        LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3))
modelStats
##   model logLike        pData     r   LSE
## 1  expo -18.666 7.820887e-09 0.982 0.019
## 2 power -26.726 2.471738e-12 0.948 0.057

Akaike information criterion

motivation

  • model is better, the higher \(P(D \mid \hat{\theta})\)
    • where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\) is the maximum likelihood estimate
  • model is worse, the more parameters it has
    • principle of parsimony (Ockham’s razor)
  • information theoretic notion:
    • amount of information lost when assuming data was generated with \(\hat{\theta}\)

definition

Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - 2\ln P(D \mid \hat{\theta})\]

The smaller the AIC, the better the model.

model comparison by AIC

##   model logLike        pData     r   LSE      AIC
## 1  expo -18.666 7.820887e-09 0.982 0.019 41.33294
## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220
show(AIC(bestExpo, bestPow))
##          df      AIC
## bestExpo  2 41.33294
## bestPow   2 57.45220

remarks on AIC

  • given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
  • “model” for AICs is just likelihood; no prior
  • discounting number of parameters, like AIC does, does not take effective strength of parameters into account
    • think: individual-level parameters harnessed by hierarchical population-level prior
  • there are other information criteria that overcome some of these problems:
    • Bayesian information criterion
    • deviance information criterion

Bayes factors

  • take two models:
    • \(P(\theta_1 \mid M_1)\) and \(P(D \mid \theta_1, M_1)\)
    • \(P(\theta_2 \mid M_2)\) and \(P(D \mid \theta_2, M_2)\)
  • ideally, we’d want to know the absolute probability of \(M_i\) given the data
    • but then we’d need to know set of all models (for normalization)
  • alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

marginal likelihood

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

marginal likelihood of data under model \(M_i\)

\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

  • we marginalize out parameters \(\theta_i\)
  • this is a function of the prior and the likelihood

how to interpret Bayes factors

BF(M1 > M2) interpretation
1 irrelevant data
1 - 3 hardly worth ink or breath
3 - 6 anecdotal
6 - 10 now we’re talking: substantial
10 - 30 strong
30 - 100 very strong
100 + decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

  1. get each model’s marginal likelihood
    • grid approximation
      (today)
    • by Monte Carlo sampling
      (today)
    • brute force clever math
      (next time)
    • bridge sampling
      (next time)
  2. get Bayes factor directly
    • Savage-Dickey method
      (next time)
    • transdimensional MCMC and now to some more space
      (not here)

grid approximation

  • consider discrete values for \(\theta\)
  • compute evidence in terms of them
  • works well for low-dimensional \(\theta\)

model specification (priors & likelihood function)

exponential

priorExp = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExp = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

power

priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

example

lhExp = Vectorize(lhExp)
lhPow = Vectorize(lhPow)

stepsize = 0.01
evidence = expand.grid(x = seq(0.005, 1.495, by = stepsize),
                       y = seq(0.005, 1.495, by = stepsize)) %>% 
  mutate(lhExp = lhExp(x,y), priExp = 1 / length(x),  # uniform priors!
         lhPow = lhPow(x,y), priPow = 1 / length(x))

paste0("BF in favor of exponential model: ", 
            with(evidence, sum(priExp*lhExp)/ sum(priPow*lhPow)) %>% round(2))
## [1] "BF in favor of exponential model: 1221.39"

overwhelming evidence in favor of the exponential model

naive Monte Carlo simulation

generally:

\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]

in particular:

\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]

naive MC approxiamtion of marginal likelihoods

nSamples = 1000000
a = runif(nSamples, 0, 1.5)
b = runif(nSamples, 0, 1.5)
lhExpVec = lhExp(a,b)
lhPowVec = lhPow(a,b)
paste0("BF in favor of exponential model: ", 
            sum(lhExpVec) / sum(lhPowVec))
## [1] "BF in favor of exponential model: 1242.49990813332"

time course of estimate

BFVec = sapply(seq(10000,nSamples, by = 500), 
     function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i]))
ggplot(data.frame(i = seq(10000,nSamples, by = 500), BF = BFVec), aes(x=i, y=BF)) +
  geom_line() + geom_hline(yintercept = 1221, color = "firebrick") + xlab("number of samples")

summary

standard approach

  • Akaike information criterion
  • likelihood function \(P(D\mid \theta)\)
  • maximum likelihood estimate \[\hat{\theta} \in \arg\max_\theta P(D\mid \theta)\]
  • ex post: model uses data
  • counts number of free parameters
  • relatively easy to compute

Bayes

  • Bayes factors
  • likelihood function \(P(D\mid \theta)\) & prior \(P(\theta)\)
  • marginalized (prior) likelihood \[\int P(D \mid \theta) \ P(\theta) \text{d}\theta\]
  • ex ante: model doesn’t use data
  • implicitly weighs in effectiveness of parameters
  • relatively hard to compute

outlook

 

Thursday

  • more on model comparison
  • Savage-Dickey
  • bridge sampling

 

Tuesday

  • Bayes prediction & hypothesis testing

to prevent boredom

 

obligatory

  • read Kruschke Chapter 10

 

optional