Model Comparison: Part 2

Michael Franke

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

Bayes factors (recap)

take two models:
- \(P(\theta_1 \mid M_1)\) and \(P(D \mid \theta_1, M_1)\)
- \(P(\theta_2 \mid M_2)\) and \(P(D \mid \theta_2, M_2)\)
ideally, we’d want to know the absolute probability of \(M_i\) given the data
- but then we’d need to know set of all models (for normalization)
alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

marginal likelihood (recap)

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

marginal likelihood of data under model \(M_i\)

\[P(D \mid M_i) = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

we marginalize out parameters \(\theta_i\)
this is a function of the prior and the likelihood

comparison against MLE-method (recap)

standard approach

Akaike information criterion
likelihood function \(P(D\mid \theta)\)
maximum likelihood estimate \[\hat{\theta} \in \arg\max_\theta P(D\mid \theta)\]
ex post: model uses data
counts number of free parameters
relatively easy to compute

Bayes

Bayes factors
likelihood function \(P(D\mid \theta)\) & prior \(P(\theta)\)
marginalized (prior) likelihood \[\int P(D \mid \theta) \ P(\theta) \text{d}\theta\]
ex ante: model doesn’t use data
implicitly weighs in effectiveness of parameters
relatively hard to compute

how to caculate Bayes factors

get each model’s marginal likelihood
- grid approximation
  (last time)
- by Monte Carlo sampling
  (last time)
- brute force clever math
  (today)
- bridge sampling
  (today)
get Bayes factor directly
- Savage-Dickey method
  (today)
- transdimensional MCMC and now to some more space
  (not here)

Savage-Dickey method

example: model comparison for coin flip data

observed: \(k = 7\) out of \(N = 24\) coin flips were successes
goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
- \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
- \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

exact calculation

data: \(k = 7\) successes in \(N = 24\) trials

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 \binom{N}{k} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{\binom{N}{k} 0.5^{N}}{\binom{N}{k} \int_0^1 \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{N}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

properly nested models

suppose that ther are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
\(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
\(M_0\) is properly nested under \(M_1\) if:
- \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
- \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
- \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

proof

\(M_0\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi = \phi_0\)
\(M_1\) has parameters \(\theta = \tuple{\phi, \psi}\) with \(\phi\) free to vary
crucial assumption: \(\lim_{\phi \rightarrow \phi_0} P(\psi \mid \phi, M_1) = P(\psi \mid M_0)\)
rewrite marginal likelihood under \(M_0\):

\[ \begin{align*} P(D \mid M_0) & = \int P(D \mid \psi, M_0) P(\psi \mid M_0) \ \text{d}\psi \\ & = \int P(D \mid \psi, \phi = \phi_0, M_1) P(\psi \mid \phi = \phi_0, M_1) \ \text{d}\psi\\ & = P(D \mid \phi = \phi_0, M_1) \ \ \ \ \ \ \text{(by Bayes rule)} \\ & = \frac{P(\phi = \phi_0 \mid D, M_1) P(D \mid M_1)}{P(\phi = \phi_0 \mid M_1)} \end{align*} \]

Savage-Dickey method applied to the Generalized Context Model

Generalized Context Model ::: Stimuli

GCM_stimuli

see Lee & Wagenmakers (2015) Chapter 17

Generalized Context Model ::: Model

CGM

Generalized Context Model ::: Stan code

data { 
  int nstim; int t;
  int a[nstim];
  int y[nstim];
  matrix[nstim,nstim] d1;
  matrix[nstim,nstim] d2;
}
transformed data {
  real b;  b = .5;
}
parameters {
  real<lower=0,upper=5> c;
  real<lower=0,upper=1> w;
}

transformed parameters {
  vector<lower=0,upper=1>[nstim] r; 
  real tmp1[nstim,nstim,2];
  real tmp2[nstim,nstim,2];
  for (i in 1:nstim) {
    vector[nstim] numerator;
    vector[nstim] denominator;
    for (j in 1:nstim) {
      real s;
      // Similarities
      s = exp(-c * (w * d1[i,j] + (1 - w) * d2[i,j])); 
      // Decision Probabilities
      tmp1[i,j,1] = b * s;
      tmp1[i,j,2] = 0;
      tmp2[i,j,1] = 0;
      tmp2[i,j,2] = (1 - b) * s;
      numerator[j] = tmp1[i,j,a[j]];
      denominator[j] = tmp1[i,j,a[j]] + tmp2[i,j,a[j]];
    }
    r[i] = sum(numerator) / sum(denominator);
  }  
}

// explicitly increment the log-score with
// to allow the bridge sampler to do its magic
model {
  target += uniform_lpdf(c | 0, 5); // prior for c
  target += beta_lpdf(w | 1, 1); // prior for w 
  target += binomial_lpmf(y | t, r); // LH of data
}

GCM ::: Posterior inference & BF

# get samples from stan
samples <- stan(file = "../code/08_GCM_complex.stan",   
                data=data, pars = c("c", "w"),
                iter=10000, chains=4, 
                thin=1, warmup=1000)
# tidy up samples
ms = ggmcmc::ggs(samples)

# plot posterior density
ms %>% filter(Parameter == "w") %>% 
  ggplot(aes(x = value)) + geom_density() +
  geom_hline(yintercept = 1, color = "firebrick")

# approximate posterior of w at point w=0.5
post_samples = filter(ms, Parameter == "w")$value
fit.posterior = polspline::logspline(post_samples, 
                                     ubound=1, lbound=0)
posterior_w = polspline::dlogspline(0.5, fit.posterior)

# calculate Bayes factor via Savage-Dickey
paste0("Approximate BF: ", 
       round(1/posterior_w,3) )

## [1] "Approximate BF: 4.747"

sampling-based approximations of marginal likelihoods

reformulating marginal likelihood

naive Monte Carlo

\[ P(D) = \int P_{\text{prior}}(\theta) \ P(D \mid \theta) \ \text{d} \theta = \mathbb{E}_{P_{\text{prior}}(\theta)} \left [ P(D \mid \theta) \right ]\]

importance sampling

\[ P(D) = \mathbb{E}_{g_{IS}(\theta)} \left [ \frac{P_{\text{prior}}(\theta) \ P(D \mid \theta)}{g_{IS}(\theta)} \right ]\]

generalized harmonic mean

\[ P(D) = \left [ \mathbb{E}_{P_{\text{posterior}}(\theta \mid D)} \left [ \frac{g_{HM}(\theta)}{P_{\text{prior}}(\theta) \ P(D \mid \theta)} \right ] \right ]^{-1} \]

bridge

\[ P(D) = \frac{\mathbb{E}_{g_{\text{proposal}}}(\theta) \left [ P(D \mid \theta) \ P_{\text{prior}}(\theta) \ h_{\text{bridge}}(\theta) \right ] } {\mathbb{E}_{P_{\text{posterior}}(\theta \mid D)} \left [ h_{\text{bridge}}(\theta) \ g_{\text{proposal}}(\theta) \right ]}\]

generalized harmonic mean sampler

since \(g_{HM}(\theta)\) is a probability distribution, we have \(\int g_{HM}(\theta) \text{d}\theta = 1\); therefore:

\[ \begin{align*} \frac{1}{P(D)} & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)}\\ & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \int g_{HM}(\theta) \text{d}\theta \\ & = \int \frac{g_{HM}(\theta) P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \text{d}\theta \\ & \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta \mid D)} \frac{g_{HM}(\theta)}{P(D \mid \theta) P(\theta)} \end{align*} \]

choose a \(g_{HM}(\theta)\) that resembles the posterior

for more info see the bridge sampling tutorial

bridge sampling

`bridge sampling`

library(bridgesampling)

# posterior samples from the complex model
samples_complex <- stan(file = "../code/08_GCM_complex.stan",   
                data=data, pars = c("c", "w"),
                iter=100000, chains=4, 
                thin=2, warmup=1000
)

# posterior samples from the simple model
samples_simple <- stan(file = "../code/08_GCM_simple.stan",   
                        data=data, pars = c("c"),
                        iter=100000, chains=4, 
                        thin=2, warmup=1000
)

bridge_complex = bridgesampling::bridge_sampler(samples = samples_complex)
bridge_simple  = bridgesampling::bridge_sampler(samples = samples_simple)

# calculate Bayes factor via Savage-Dickey
paste0("Approximate BF in favor of complex model: ", 
round(bridgesampling::bayes_factor(bridge_complex,
                        bridge_simple)[1]$bf,3) )

## [1] "Approximate BF in favor of complex model: 4.394"

paste0("Approximate error percentage: ", 
 list('complex' = error_measures(bridge_complex)$percentage, 
  'simple' = error_measures(bridge_simple)$percentage))

## [1] "Approximate error percentage: 0.00718%"
## [2] "Approximate error percentage: 0.0064%"

outlook & preparatory reading

Thursday

predictions & hypothesis testing
read Kruschke Chapter 12