\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
which of two models is more likely, given the data?
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
standard | Bayes | |
---|---|---|
today's focus | Akaike's information criterion | Bayes factor |
what counts as model to be compared? | likelihood function \(P(D \mid \theta)\) | likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\) |
from which point of view do we compare models? | ex post: after seeing the data | ex ante: before seeing the data |
how to penalize model complexity? | count number of free parameters | implicitly weigh how effective a parameter is |
are we guaranteed to select the true model in the end? | no | yes |
hard to compute? | relatively easy | relatively hard |
\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 \binom{N}{k} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{N}}{\binom{N}{k} \int_0^1 \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]
unraveling beta function
\[ \begin{align*} BetaFunction(k+1, N-k+1) & = \frac{\Gamma(k+1) \Gamma(N-k+1)}{\Gamma(N+2)} \\ & = \frac{(k)! (N-k)!}{(N+1)!} \\ & = \frac{(k)! (N-k)!}{(N)!} \frac{1}{N+1} \\ & = \frac{1}{\binom{N}{k}} \frac{1}{N+1} \\ \end{align*} \]
so:
\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \\ & = \binom{N}{k} 0.5^{N} (N + 1) \approx 0.516 \end{align*} \]
let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)
\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]
\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]
y
for 100 trials after time t
in secondsy = c(.94, .77, .40, .26, .24, .16) t = c( 1, 3, 6, 9, 12, 18) obs = y*100
example from Myung (2007), JoMP, tutorial on MLE
recall probabilities for different times \(t\)
exponential model
\[P(t \ ; \ a, b) = a \exp (-bt)\]
\[\text{where } a,b>0 \]
exponential
priorExp = function(a, b){ dunif(a, 0, 1.5) * dunif(b, 0, 1.5) } lhExp = function(a, b){ p = a*exp(-b*t) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 prod(dbinom(x = obs, prob = p, size = 100)) }
power
priorPow = function(c, d){ dunif(c, 0, 1.5) * dunif(d, 0, 1.5) } lhPow = function(c, d){ p = c*t^(-d) p[p <= 0.0] = 1.0e-5 p[p >= 1.0] = 1-1.0e-5 prod(dbinom(x = obs, prob = p, size = 100)) }
lhExp = Vectorize(lhExp) lhPow = Vectorize(lhPow) stepsize = 0.01 evidence = expand.grid(x = seq(0.005, 1.495, by = stepsize), y = seq(0.005, 1.495, by = stepsize)) %>% mutate(lhExp = lhExp(x,y), priExp = 1 / length(x), # uniform priors! lhPow = lhPow(x,y), priPow = 1 / length(x)) paste0("BF in favor of exponential model: ", with(evidence, sum(priExp*lhExp)/ sum(priPow*lhPow)) %>% round(2))
## [1] "BF in favor of exponential model: 1221.39"
overwhelming evidence in favor of the exponential model
generally:
\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]
in particular:
\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]
nSamples = 200000 a = runif(nSamples, 0, 1.5) b = runif(nSamples, 0, 1.5) lhExpVec = lhExp(a,b) lhPowVec = lhPow(a,b) paste0("BF in favor of exponential model: ", with(evidence, sum(lhExpVec)/ sum(lhPowVec)) %>% round(2))
## [1] "BF in favor of exponential model: 1182.32"
BFVec = sapply(seq(10000,nSamples, by = 200), function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i])) ggplot(data.frame(i = seq(10000,nSamples, by = 200), BF = BFVec), aes(x=i, y=BF)) + geom_line() + geom_hline(yintercept = 1221, color = "firebrick") + xlab("number of samples")
take an arbitrary function \(h(\theta)\) such that \(\int h(\theta) \text{d}\theta = 1\)
\[ \begin{align*} \frac{1}{P(D)} & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)}\\ & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \int h(\theta) \text{d}\theta \\ & = \int \frac{h(\theta) P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \text{d}\theta \\ & \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta \mid D)} \frac{h(\theta)}{P(D \mid \theta) P(\theta)} \end{align*} \]
choose a \(h(\theta)\) that resembles the posterior
model{ m ~ dbern(0.5) a ~ dunif(0,1.5) b ~ dunif(0,1.5) c ~ dunif(0,1.5) d ~ dunif(0,1.5) for (i in 1: 6){ pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d)) p[i] = min(max( pT[i], 0.0001), 0.9999) obs[i] ~ dbinom(p[i], 100) } }
modelString = " model{ a ~ dunif(0,1.5) b ~ dunif(0,1.5) for (i in 1: 6){ pT[i] = a*exp(-t[i]*b) p[i] = min(max( pT[i], 0.00001), 0.99999) obs[i] ~ dbinom(p[i], 100) } }" # prepare for JAGS dataList = list(obs = obs, t = t) # set up and run model jagsModel = jags.model(file = textConnection(modelString), data = dataList, n.chains = 2) update(jagsModel, n.iter = 25000) codaSamples = coda.samples(jagsModel, variable.names = c("a", "b"), n.iter = 5000) ms = ggs(codaSamples)
getGammaApprox = function(samples){ s = sd(samples) m = mean(samples) ra = ( m + sqrt( m^2 + 4*s^2 ) ) / ( 2 * s^2 ) sh = 1 + m * ra return(c(shape = sh, rate = ra)) } wideSamples = dcast(ms, Iteration + Chain ~ Parameter) paramA = getGammaApprox(filter(ms, Parameter == "a")$value) paramB = getGammaApprox(filter(ms, Parameter == "b")$value)
## parameter shape rate ## 1 a 1121.0532 1056.1074 ## 2 b 202.4942 1549.5401 ## 3 c 2331.6908 2462.5925 ## 4 d 232.5322 465.8463
homework!
make model index a parameter
dummy dummy
model{ m ~ dbern(0.5) a ~ dunif(0,1.5) b ~ dunif(0,1.5) c ~ dunif(0,1.5) d ~ dunif(0,1.5) for (i in 1: 6){ pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d)) p[i] = min(max( pT[i], 0.0001), 0.9999) obs[i] ~ dbinom(p[i], 100) } }
tweaking prior model odds & using pseudo-priors
model{ mT ~ dbern(0.9995) m = mT + 1 a[1] ~ dunif(0,1.5) b[1] ~ dunif(0,1.5) c[2] ~ dunif(0,1.5) d[2] ~ dunif(0,1.5) a[2] ~ dgamma(1124.84336, 1058.83299) b[2] ~ dgamma(199.82804, 1527.78959) c[1] ~ dgamma(61.88311, 29.28403) d[1] ~ dgamma(115.44879, 126.33878) for (i in 1: 6){ pExp[i] = a[m]*exp(-t[i]*b[m]) pPow[i] = c[m]*t[i]^(-d[m]) pT[i] = ifelse(m == 1, pExp[i], pPow[i]) p[i] = min(max( pT[i], 0.00001), 0.99999) obs[i] ~ dbinom(p[i], 100) } }
## [1] "BF in favor of Exponential Model (using transdimensional MCMC): 948.77"
standard | Bayes | |
---|---|---|
today's focus | Akaike's information criterion | Bayes factor |
what counts as model to be compared? | likelihood function \(P(D \mid \theta)\) | likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\) |
from which point of view do we compare models? | ex post: after seeing the data | ex ante: before seeing the data |
how to penalize model complexity? | count number of free parameters | implicitly weigh how effective a parameter is |
are we guaranteed to select the true model in the end? | no | yes |
hard to compute? | relatively easy | relatively hard |
Friday
Tuesday