Model Comparison

Michael Franke

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

roadmap

(recap) 3 pillars of Bayesian data analysis
- estimation
- comparison
- prediction
Akaike’s information criterion
- maximum likelihood estimation
- free parameters
- (discounting) model complexity
Bayes factors
- grid approximation
- naive Monte Carlo

recap: pillars of BDA

parameter estimation

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

prior & posterior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \] \[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

outlook

standard approach

Akaike information criterion
likelihood function \(P(D\mid theta)\)
maximum likelihood estimate \[\hat{\theta} \in \arg\max_\theta P(D\mid \theta)\]
ex post: model uses data
counts number of free parameters
relatively easy to compute

Bayes

Bayes factors
likelihood function \(P(D\mid \theta)\) & prior \(P(\theta)\)
marginalized (prior) likelihood \[\int P(D \mid \theta) \ P(\theta) \text{d}\theta\]
ex ante: model doesn’t use data
implicitly weighs in effectiveness of parameters
relatively hard to compute

case study forgetting data

subjects each where asked to remember items (Murdoch 1961)
recall rates y for 100 subjects after time t in seconds

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

power model

\[P(t \ ; \ c, d) = ct^{-d}\]

\[\text{where } c,d>0 \]

negative log-likelihood

exponential

nLL.exp <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | 
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*exp(-w2*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.exp(1,1))

## [1] 1092.446

power

nLL.pow <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 |
      w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*t^(-w2)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, 
               size = 100, log = T))
}
show(nLL.pow(1,1))

## [1] 142.1096

MLE

require('stats4')
bestExpo = mle(nLL.exp, start = list(w1=1,w2=1))
bestPow  = mle(nLL.pow, start = list(w1=1,w2=1))
MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2),
                         parameter = c("a", "b", "c", "d"),
                         value = c(coef(bestExpo), coef(bestPow)))
knitr::kable(MLEstimates)

model	parameter	value
expo	a	1.0701173
expo	b	0.1308300
power	c	0.9531186
power	d	0.4979311

best fits visualization

model comparison

which model is better?

predExp = expo(t,a,b)
predPow = power(t,c,d)
modelStats = data.frame(model = c("expo", "power"),
                        logLike = round(c(logLik(bestExpo), logLik(bestPow)),3),
                        pData = exp(c(logLik(bestExpo), logLik(bestPow))),
                        r = round(c(cor(predExp, y), cor(predPow,y)),3),
                        LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3))
modelStats

##   model logLike        pData     r   LSE
## 1  expo -18.666 7.820887e-09 0.982 0.019
## 2 power -26.726 2.471738e-12 0.948 0.057

Akaike information criterion

motivation

model is better, the higher \(P(D \mid \hat{\theta})\)
- where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\) is the maximum likelihood estimate
model is worse, the more parameters it has
- principle of parsimony (Ockham’s razor)
information theoretic notion:
- amount of information lost when assuming data was generated with \(\hat{\theta}\)

definition

Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - 2\ln P(D \mid \hat{\theta})\]

The smaller the AIC, the better the model.

model comparison by AIC

##   model logLike        pData     r   LSE      AIC
## 1  expo -18.666 7.820887e-09 0.982 0.019 41.33294
## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220

show(AIC(bestExpo, bestPow))

##          df      AIC
## bestExpo  2 41.33294
## bestPow   2 57.45220

remarks on AIC

given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
“model” for AICs is just likelihood; no prior
discounting number of parameters, like AIC does, does not take effective strength of parameters into account
- think: individual-level parameters harnessed by hierarchical population-level prior
there are other information criteria that overcome some of these problems:
- Bayesian information criterion
- deviance information criterion

Bayes factors

take two models:
- \(P(\theta_1 \mid M_1)\) and \(P(D \mid \theta_1, M_1)\)
- \(P(\theta_2 \mid M_2)\) and \(P(D \mid \theta_2, M_2)\)
ideally, we’d want to know the absolute probability of \(M_i\) given the data
- but then we’d need to know set of all models (for normalization)
alternatively, we take odds of models given the data:

The Bayes factor is the factor by which our prior odds are changed by the data.

marginal likelihood

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

marginal likelihood of data under model \(M_i\)

\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

we marginalize out parameters \(\theta_i\)
this is a function of the prior and the likelihood

how to interpret Bayes factors

BF(M1 > M2)	interpretation
1	irrelevant data
1 - 3	hardly worth ink or breath
3 - 6	anecdotal
6 - 10	now we’re talking: substantial
10 - 30	strong
30 - 100	very strong
100 +	decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

get each model’s marginal likelihood
- grid approximation
  (today)
- by Monte Carlo sampling
  (today)
- brute force clever math
  (next time)
- bridge sampling
  (next time)
get Bayes factor directly
- Savage-Dickey method
  (next time)
- transdimensional MCMC and now to some more space
  (not here)

grid approximation

consider discrete values for \(\theta\)
compute evidence in terms of them
works well for low-dimensional \(\theta\)

model specification (priors & likelihood function)

exponential

priorExp = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExp = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

power

priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

example

lhExp = Vectorize(lhExp)
lhPow = Vectorize(lhPow)

stepsize = 0.01
evidence = expand.grid(x = seq(0.005, 1.495, by = stepsize),
                       y = seq(0.005, 1.495, by = stepsize)) %>% 
  mutate(lhExp = lhExp(x,y), priExp = 1 / length(x),  # uniform priors!
         lhPow = lhPow(x,y), priPow = 1 / length(x))

paste0("BF in favor of exponential model: ", 
            with(evidence, sum(priExp*lhExp)/ sum(priPow*lhPow)) %>% round(2))

## [1] "BF in favor of exponential model: 1221.39"

overwhelming evidence in favor of the exponential model

naive Monte Carlo simulation

generally:

\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]

in particular:

\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]

naive MC approxiamtion of marginal likelihoods

nSamples = 1000000
a = runif(nSamples, 0, 1.5)
b = runif(nSamples, 0, 1.5)
lhExpVec = lhExp(a,b)
lhPowVec = lhPow(a,b)
paste0("BF in favor of exponential model: ", 
            sum(lhExpVec) / sum(lhPowVec))

## [1] "BF in favor of exponential model: 1242.49990813332"

time course of estimate

BFVec = sapply(seq(10000,nSamples, by = 500), 
     function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i]))
ggplot(data.frame(i = seq(10000,nSamples, by = 500), BF = BFVec), aes(x=i, y=BF)) +
  geom_line() + geom_hline(yintercept = 1221, color = "firebrick") + xlab("number of samples")

summary

standard approach

Akaike information criterion
likelihood function \(P(D\mid \theta)\)
maximum likelihood estimate \[\hat{\theta} \in \arg\max_\theta P(D\mid \theta)\]
ex post: model uses data
counts number of free parameters
relatively easy to compute

Bayes

Bayes factors
likelihood function \(P(D\mid \theta)\) & prior \(P(\theta)\)
marginalized (prior) likelihood \[\int P(D \mid \theta) \ P(\theta) \text{d}\theta\]
ex ante: model doesn’t use data
implicitly weighs in effectiveness of parameters
relatively hard to compute

outlook

Thursday

more on model comparison
Savage-Dickey
bridge sampling

Tuesday

Bayes prediction & hypothesis testing

to prevent boredom

obligatory

read Kruschke Chapter 10

optional

read a tutorial on the Savage-Dickey method
read a tutorial on bridge sampling