- (recap) 3 pillars of Bayesian data analysis:
- estimation
- comparison
- prediction
- maximum likelihood estimation
- Akaike's information criterion
- free parameters
- model complexity
- Bayes factors
- Savage-Dickey method (next time!)
\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
given model and data, which parameter values should we believe in?
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
which of two models is more likely, given the data?
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
which future observations do we expect (after seeing some data)?
prior predictive
\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]
posterior predictive
\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]
requires sampling distribution (more on this later)
special case: prior/posterior predictive \(p\)-value (model criticism)
standard | Bayes | |
---|---|---|
today's focus | Akaike's information criterion | Bayes factor |
what counts as model to be compared? | likelihood function \(P(D \mid \theta)\) | likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\) |
from which point of view do we compare models? | ex post: after seeing the data | ex ante: before seeing the data |
how to penalize model complexity? | count number of free parameters | implicitly weigh how effective a parameter is |
are we guaranteed to select the true model in the end? | no | yes |
hard to compute? | relatively easy | relatively hard |
y
for 100 subjects after time t
in secondsy = c(.94, .77, .40, .26, .24, .16) t = c( 1, 3, 6, 9, 12, 18) obs = y*100
example from Myung (2007), JoMP, tutorial on MLE
recall probabilities for different times \(t\)
exponential model
\[P(t \ ; \ a, b) = a \exp (-bt)\]
\[\text{where } a,b>0 \]
power model
\[P(t \ ; \ c, d) = ct^{-d}\]
\[\text{where } c,d>0 \]
exponential
nLL.exp <- function(w1,w2) { if (w1 < 0 | w2 < 0 | w1 > 20 | w2 > 20) { return(NA) } p = w1*exp(-w2*t) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 - sum(dbinom(x = obs, prob = p, size = 100, log = T)) } show(nLL.exp(1,1))
## [1] 1092.446
power
nLL.pow <- function(w1,w2) { if (w1 < 0 | w2 < 0 | w1 > 20 | w2 > 20) { return(NA) } p = w1*t^(-w2) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 - sum(dbinom(x = obs, prob = p, size = 100, log = T)) } show(nLL.pow(1,1))
## [1] 142.1096
require('stats4') bestExpo = mle(nLL.exp, start = list(w1=1,w2=1)) bestPow = mle(nLL.pow, start = list(w1=1,w2=1)) MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2), parameter = c("a", "b", "c", "d"), value = c(coef(bestExpo), coef(bestPow))) knitr::kable(MLEstimates)
model | parameter | value |
---|---|---|
expo | a | 1.0701173 |
expo | b | 0.1308300 |
power | c | 0.9531186 |
power | d | 0.4979311 |
which model is better?
predExp = expo(t,a,b) predPow = power(t,c,d) modelStats = data.frame(model = c("expo", "power"), logLike = round(c(logLik(bestExpo), logLik(bestPow)),3), pData = exp(c(logLik(bestExpo), logLik(bestPow))), r = round(c(cor(predExp, y), cor(predPow,y)),3), LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3)) modelStats
## model logLike pData r LSE ## 1 expo -18.666 7.820887e-09 0.982 0.019 ## 2 power -26.726 2.471738e-12 0.948 0.057
motivation
definition
Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:
\[\text{AIC}(M, D) = 2k - 2\ln P(D \mid \hat{\theta})\]
The smaller the AIC, the better the model.
## model logLike pData r LSE AIC ## 1 expo -18.666 7.820887e-09 0.982 0.019 41.33294 ## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220
show(AIC(bestExpo, bestPow))
## df AIC ## bestExpo 2 41.33294 ## bestPow 2 57.45220
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
The Bayes factor is the factor by which our prior odds are changed by the data.
Bayes factor in favor of model \(M_1\)
\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]
marginal likelihood of data under model \(M_i\)
\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]
BF(M1 > M2) | interpretation |
---|---|
1 | irrelevant data |
1 - 3 | hardly worth ink or breath |
3 - 6 | anecdotal |
6 - 10 | now we're talking: substantial |
10 - 30 | strong |
30 - 100 | very strong |
100 + | decisive (bye, bye \(M_2\)!) |
priorExpo = function(a, b){ dunif(a, 0, 1.5) * dunif(b, 0, 1.5) } lhExpo = function(a, b){ p = a*exp(-b*t) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 prod(dbinom(x = obs, prob = p, size = 100)) # no log! }
priorPow = function(c, d){ dunif(c, 0, 1.5) * dunif(d, 0, 1.5) } lhPow = function(c, d){ p = c*t^(-d) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 prod(dbinom(x = obs, prob = p, size = 100)) # no log! }
flat priors cancel out
grid = seq(0.005, 1.495, by = 0.01) margLikeExp = sum(sapply(grid, function(a) sum(sapply (grid, function(b) lhExpo(a,b)))) ) margLikePow = sum(sapply(grid, function(a) sum(sapply (grid, function(b) lhPow(a,b)))) ) show(as.numeric(margLikeExp / margLikePow))
## [1] 1221.392
overwhelming evidence in favor of the exponential model
let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)
\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]
\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]
standard | Bayes | |
---|---|---|
today's focus | Akaike's information criterion | Bayes factor |
what counts as model to be compared? | likelihood function \(P(D \mid \theta)\) | likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\) |
from which point of view do we compare models? | ex post: after seeing the data | ex ante: before seeing the data |
how to penalize model complexity? | count number of free parameters | implicitly weigh how effective a parameter is |
are we guaranteed to select the true model in the end? | no | yes |
hard to compute? | relatively easy | relatively hard |
Friday
Tuesday
obligatory
optional