\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

(Bayesian) model comparison

which of two models is more likely, given the data?

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

summary

standard Bayes
today's focus Akaike's information criterion Bayes factor
what counts as model to be compared? likelihood function \(P(D \mid \theta)\) likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models? ex post: after seeing the data ex ante: before seeing the data
how to penalize model complexity? count number of free parameters implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end? no yes
hard to compute? relatively easy relatively hard

roadmap for today

  • Bayes factor computation
    • exact computation
    • Savage-Dickey method (recap)
    • grid approximation
    • Monte Carlo simulation
      • naive
      • importance sampling
    • transdimensional MCMC

Savage Dickey method

example: model comparison for coin flip data

  • observed: \(k = 7\) out of \(N = 24\) coin flips were successes
  • goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
    • \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
    • \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

   

exact calculation

  • data: \(k = 7\) successes in \(N = 24\) trials

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 \binom{N}{k} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{N}}{\binom{N}{k} \int_0^1 \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

exact calculation (cont.)

unraveling beta function

\[ \begin{align*} BetaFunction(k+1, N-k+1) & = \frac{\Gamma(k+1) \Gamma(N-k+1)}{\Gamma(N+2)} \\ & = \frac{(k)! (N-k)!}{(N+1)!} \\ & = \frac{(k)! (N-k)!}{(N)!} \frac{1}{N+1} \\ & = \frac{1}{\binom{N}{k}} \frac{1}{N+1} \\ \end{align*} \]

so:

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \\ & = \binom{N}{k} 0.5^{N} (N + 1) \approx 0.516 \end{align*} \]

properly nested models

  • suppose that ther are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
  • \(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
  • \(M_0\) is properly nested under \(M_1\) if:
    • \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
    • \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
    • \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

proof

  • \(M_0\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi = \phi_0\)
  • \(M_1\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi\) free to vary
  • crucial assumption: \(\lim_{\phi \rightarrow \phi_0} P(\psi \mid \phi, M_1) = P(\psi \mid M_0)\)
  • rewrite marginal likelihood under \(M_0\):

\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]

memory models

forgetting data

  • subjects each where asked to remember items (Murdoch 1961)
  • recall rates y for 100 trials after time t in seconds
y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential model

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

model specification (priors & likelihood function)

exponential

priorExp = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExp = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

power

priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

grid approximation

grid approximation

  • consider discrete values for \(\theta\)
  • compute evidence in terms of them
  • works well for low-dimensional \(\theta\)

example

lhExp = Vectorize(lhExp)
lhPow = Vectorize(lhPow)

stepsize = 0.01
evidence = expand.grid(x = seq(0.005, 1.495, by = stepsize),
                       y = seq(0.005, 1.495, by = stepsize)) %>% 
  mutate(lhExp = lhExp(x,y), priExp = 1 / length(x),  # uniform priors!
         lhPow = lhPow(x,y), priPow = 1 / length(x))

paste0("BF in favor of exponential model: ", 
            with(evidence, sum(priExp*lhExp)/ sum(priPow*lhPow)) %>% round(2))
## [1] "BF in favor of exponential model: 1221.39"

overwhelming evidence in favor of the exponential model

naive Monte Carlo simulation

recap: why simulate?

generally:

\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]

in particular:

\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]

example: naive sampling

nSamples = 200000
a = runif(nSamples, 0, 1.5)
b = runif(nSamples, 0, 1.5)
lhExpVec = lhExp(a,b)
lhPowVec = lhPow(a,b)
paste0("BF in favor of exponential model: ", 
            with(evidence, sum(lhExpVec)/ sum(lhPowVec)) %>% round(2))
## [1] "BF in favor of exponential model: 1182.32"

time course of estimate

BFVec = sapply(seq(10000,nSamples, by = 200), 
     function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i]))
ggplot(data.frame(i = seq(10000,nSamples, by = 200), BF = BFVec), aes(x=i, y=BF)) +
  geom_line() + geom_hline(yintercept = 1221, color = "firebrick") + xlab("number of samples")

importance sampling

importance sampling

take an arbitrary function \(h(\theta)\) such that \(\int h(\theta) \text{d}\theta = 1\)

\[ \begin{align*} \frac{1}{P(D)} & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)}\\ & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \int h(\theta) \text{d}\theta \\ & = \int \frac{h(\theta) P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \text{d}\theta \\ & \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta \mid D)} \frac{h(\theta)}{P(D \mid \theta) P(\theta)} \end{align*} \]

choose a \(h(\theta)\) that resembles the posterior

JAGS model

model{
  m ~ dbern(0.5)
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posterior over parameters for exponential model

modelString = "
model{
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = a*exp(-t[i]*b)
    p[i] = min(max( pT[i], 0.00001), 0.99999)
    obs[i] ~ dbinom(p[i], 100)
  }
}"
# prepare for JAGS
dataList = list(obs = obs, t = t)
# set up and run model
jagsModel = jags.model(file = textConnection(modelString), 
                       data = dataList,
                       n.chains = 2)
update(jagsModel, n.iter = 25000)
codaSamples = coda.samples(jagsModel, 
                           variable.names = c("a", "b"),
                           n.iter = 5000)
ms = ggs(codaSamples)

parametric approximation for posteriors

getGammaApprox = function(samples){
  s = sd(samples)
  m = mean(samples)
  ra = ( m + sqrt( m^2 + 4*s^2 ) ) / ( 2 * s^2 )
  sh = 1 + m * ra
  return(c(shape = sh, rate = ra))
}
wideSamples = dcast(ms, Iteration + Chain ~ Parameter)
paramA = getGammaApprox(filter(ms, Parameter == "a")$value)
paramB = getGammaApprox(filter(ms, Parameter == "b")$value)

parameter estimation for power model

summary of estimated parameters

##   parameter     shape      rate
## 1         a 1121.0532 1056.1074
## 2         b  202.4942 1549.5401
## 3         c 2331.6908 2462.5925
## 4         d  232.5322  465.8463

implement these posterior approximations

homework!

transcendental MCMC

idea

make model index a parameter

dummy dummy

probMod

JAGS model

model{
  m ~ dbern(0.5)
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posteriors

trickery

tweaking prior model odds & using pseudo-priors

model{
  mT ~ dbern(0.9995)
  m = mT + 1
  a[1] ~ dunif(0,1.5)
  b[1] ~ dunif(0,1.5)
  c[2] ~ dunif(0,1.5)
  d[2] ~ dunif(0,1.5)
  a[2] ~ dgamma(1124.84336, 1058.83299)
  b[2] ~ dgamma(199.82804, 1527.78959)
  c[1] ~ dgamma(61.88311, 29.28403)
  d[1] ~ dgamma(115.44879, 126.33878)
for (i in 1: 6){
    pExp[i] = a[m]*exp(-t[i]*b[m])
    pPow[i] = c[m]*t[i]^(-d[m]) 
    pT[i] = ifelse(m == 1, pExp[i], pPow[i])
    p[i] = min(max( pT[i], 0.00001), 0.99999)
    obs[i] ~ dbinom(p[i], 100)
  }  
}

posteriors

## [1] "BF in favor of Exponential Model (using transdimensional MCMC): 948.77"

fini

summary

standard Bayes
today's focus Akaike's information criterion Bayes factor
what counts as model to be compared? likelihood function \(P(D \mid \theta)\) likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)
from which point of view do we compare models? ex post: after seeing the data ex ante: before seeing the data
how to penalize model complexity? count number of free parameters implicitly weigh how effective a parameter is
are we guaranteed to select the true model in the end? no yes
hard to compute? relatively easy relatively hard

outlook

 

Friday

  • philosophy of Bayes Factors, some more tricks and applications

 

Tuesday

  • no class!