06: Model comparison

Bayesian regression: theory & practice

Author

Michael Franke

Preamble

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Toggle code
# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "rstan",
  "rstanarm",
  "remotes",
  "tidybayes",
  "bridgesampling",
  "shinystan",
  "mgcv"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors() |> pull(hex)
# names(project_colors) <- cspplot::list_colors() |> pull(name)

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}
Toggle code
dolphin <- aida::data_MT
rerun_models = FALSE

Comparing models with LOO-CV and Bayes factors

Suppose that the ground truth is a robust regression model generating our data (a robust regression uses a Student-t distribution as likelihood function):

Toggle code
set.seed(1970)

# number of observations
N <- 100
# 100 samples from a standard normal
x <- rnorm(N, 0, 1)

intercept <- 2
slope <- 4

# robust regression with a Student's t error distribution
# with 1 degree of freedom
y <- rt(N, df = 1, ncp = slope * x + intercept)

data_robust <- tibble(x = x, y = y)

A plot of the data shows that we have quite a few “outliers”:

Toggle code
qplot(x,y) + 
  geom_smooth(color = project_colors[1], method = "lm") +
  geom_point(color = project_colors[2], size = 2, alpha = 0.8)

We are going to compare two models for this data, a normal regression model and a robust regression model.

Normal and robust regression models

A normal regression model uses a normal error function.

Toggle code
fit_n <- brm(
  formula = y ~ x,
  data = data_robust,
  # student prior for slope coefficient
  prior = prior("student_t(1,0,30)", class = "b"),
)

We will want to compare this normal regression model with a robust regression model, which uses a Student’s t distribution instead as the error function around the linear predictor:

Toggle code
fit_r <- brm(
  formula = y ~ x,
  data = data_robust,
  # student prior for slope coefficient
  prior = prior("student_t(1,0,30)", class = "b"),
  family = student()
)

Let’s look at the posterior inferences of both models about the true (known) parameters of the regression line:

Toggle code
prep_summary <- function(fit, model) {
  tidybayes::summarise_draws(fit) |> 
    mutate(model = model) |> 
    select(model, variable, q5, mean, q95) |> 
    filter(grepl(variable, pattern = '^b'))  
}

rbind(prep_summary(fit_n, "normal"), prep_summary(fit_r, "robust"))
# A tibble: 4 x 5
  model  variable       q5  mean   q95
  <chr>  <chr>       <num> <num> <num>
1 normal b_Intercept  2.31  7.54 13.0 
2 normal b_x          7.22 13.4  19.4 
3 robust b_Intercept  1.81  2.49  3.24
4 robust b_x          4.98  6.09  7.23

Remember that the true intercept is 2 and the true slope is 4. Clearly the robust regression model has recovered the ground-truth parameters much better.

Leave-one-out cross validation

We can use the loo package to compare these two models based on their posterior predictive fit. Here’s how:

Toggle code
loo_comp <- loo_compare(list(normal = loo(fit_n), robust = loo(fit_r)))
loo_comp
       elpd_diff se_diff
robust    0.0       0.0 
normal -131.7      26.0 

We see that the robust regression model is better by ca. -132 points of expected log predictive density. The table shown above is ordered with the “best” model on top. The column elpd_diff lists the difference in ELPD of every model to the “best” one. In our case, th estimated ELPD difference has a standard error of about 26. Computing a \(p\)-value for this using Lambert’s \(z\)-score method, we find that this difference is “significant” (for which we will use other terms like “noteworthy” or “substantial” in the following):

Toggle code
1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
[1] 0

We conclude from this that the robust regression model is much better at predicting the data (from a posterior point of view).

Bayes factor model comparison (with bridge sampling)

We use bridge sampling, as implemented in the formidable bridgesampling package, to estimate the (log) marginal likelihood of each model. To do this, we need also samples from the prior. To do this reliably, we need many more samples than we would normally need for posterior inference. We can update() existing fitted models, so that we do not have to copy-paste all specifications (formula, data, prior, …) each time. It’s important for bridge_sampler() to work that we save all parameters (including prior samples).

Toggle code
if (rerun_models) {
  # refit normal model
  fit_n_4Bridge <- update(
    fit_n,
    iter = 5e5,
    save_pars = save_pars(all = TRUE)
  )
  # refit robust model
  fit_r_4Bridge <- update(
    fit_r,
    iter = 5e5,
    save_pars = save_pars(all = TRUE)
  )
  normal_bridge <- bridge_sampler(fit_n_4Bridge, silent = T)
  write_rds(normal_bridge, "06-normal_bridge.rds")
  robust_bridge <- bridge_sampler(fit_r_4Bridge, silent = T)  
  write_rds(robust_bridge, "06-robust_bridge.rds")
} else {
  normal_bridge <- read_rds("06-normal_bridge.rds")  
  robust_bridge <- read_rds("06-robust_bridge.rds")
}

bf_bridge <- bridgesampling::bf(robust_bridge, normal_bridge)

We can then use the bf (Bayes factor) method from the bridgesampling package to get the Bayes factor (here: in favor of the robust regression model):

Toggle code
bf_bridge
Estimated Bayes factor in favor of robust_bridge over normal_bridge: 41136407426471809154394622543225576928422395904.00000

As you can see, this is a very clear result. If we had equal levels of credence in both models, after seeing the data, our degree of belief in the robust regression model should … well, virtually infinitely higer than our degree of belief in the normal model.

Comparing LOO-CV and Bayes factors

LOO-CV and Bayes factor gave similar results in the previous example. The results are qualitatively the same: the (true) robust regression model is preferred over the (false) normal regression model. Both methods give quantitative results, too. But here only the Bayes factor results have a clear intuitive interpretation.

In this section we will explore the main conceptual difference between LOO-CV and Bayes factors, which is:

  • LOO-CV compares models from a data-informed, ex post point of view based on a (repeatedly computed) posterior predictive distribution
  • Bayes factor model comparison takes a data-blind, ex ante point of view based on the prior predictive distribution

What does that mean in practice? – To see the crucial difference, imagine that you have tons of data, so much that they completely trump your prior. LOO-CV can use this data to emancipate itself from any wrong or too uninformative prior structure. Bayes factor comparison cannot. If a Bayesian model is a likelihood function AND a prior, Bayes factors give the genuine Bayesian comparison, taking the prior into account. That is what you want when your prior structure are really part of your theoretical commitment. If you are looking for prediction based on weak priors AND a ton of data to train on, you should not use Bayes factors.

To see the influence of priors on model comparison, we are going to look at a very simple data set generated from a standard normal distribution.

Toggle code
# number of observations
N <- 100
# data from a standard normal
y <- rnorm(N)
# list of data for Stan
data_normal <- tibble(y = y)
Exercise 1a

Use brms to implement two models for inferring a Gaussian distribution.

  • The first one has narrow priors for its parameters (mu and sigma), namely a Student’s \(t\) distribution with \(\nu = 1\), \(\mu = 0\) and \(\sigma = 10\).
  • The second one has wide priors for its parameters (mu and sigma), namely a Student’s \(t\) distribution with \(\nu = 1\), \(\mu = 0\) and \(\sigma = 1000\).
Toggle code
fit_narrow <- brm(
  formula = y ~ 1,
  data = data_normal,
  prior = c(prior("student_t(1,0,10)", class = "Intercept"),
            prior("student_t(1,0,10)", class = "sigma"))
)
fit_wide <- brm(
  formula = y ~ 1,
  data = data_normal,
  prior = c(prior("student_t(1,0,1000)", class = "Intercept"),
            prior("student_t(1,0,1000)", class = "sigma"))
)
Exercise 1b

Compare the models with LOO-CV, using the loo package, and interpret the outcome.

Toggle code
loo_compare(
  list(
    wide   = loo(fit_wide),
    narrow = loo(fit_narrow)
  )
)
       elpd_diff se_diff
narrow  0.0       0.0   
wide   -0.1       0.0   

The models are pretty much incomparable (equally good/bad) based on the LOO-CV.

Exercise 1c

Use the bridgesampling package to find an (approximate) Bayes factor for this model comparison.

Toggle code
if (rerun_models) {
  fit_narrow_4Bridge <- update(
    object = fit_narrow,
    iter = 5e5,
    save_pars = save_pars(all = TRUE)
  )
  
  fit_wide_4Bridge <- update(
    object = fit_wide,
    iter = 5e5,
    save_pars = save_pars(all = TRUE)
  )
  
  narrow_bridge <- bridge_sampler(fit_narrow_4Bridge, silent = T)
  write_rds(narrow_bridge, "06-narrow_bridge.rds")
  
  wide_bridge   <- bridge_sampler(fit_wide_4Bridge, silent = T)
  write_rds(wide_bridge, "06-wide_bridge.rds")
  
} else {
  narrow_bridge <- read_rds("06-narrow_bridge.rds")  
  wide_bridge <- read_rds("06-wide_bridge.rds")
}

bridgesampling::bf(narrow_bridge, wide_bridge)
Estimated Bayes factor in favor of narrow_bridge over wide_bridge: 9868.34657

The Bayes factors in favor of the narrow model is about 10000. That’s massive evidence that, from a prior point of view, the narrow model is much better.

Exercise 1d

If all went well, you should have seen a difference between the LOO-based and the BF-based model comparison. Explain what’s going on in your own words.

Since BF-based comparison looks at the models from the prior point of view, the model with wide priors is less precise, puts prior weight on a lot of “bad” paramter values and so achieves a very weak prior predicitive fit.

The LOO-based estimates are identical because both models have rather flexible, not too strong priors, and so the data is able to produce roughly the same posteriors in both models.

Comparing (hierarchical) regression models

We are going to consider an example from the mouse-tracking data. We use categorical variables group and condition to predict MAD measures. We are going to compare different models, including models which only differ with respect to random effects.

Let’s have a look at the data first to remind ourselves:

Toggle code
# aggregate
dolphin <- dolphin %>% 
  filter(correct == 1) 

# plotting the data
ggplot(data = dolphin, 
       aes(x = MAD, 
           color = condition, fill = condition)) + 
  geom_density(alpha = 0.3, linewidth = 0.4, trim = F) +
  facet_grid(~group) +
  xlab("MAD")

Exercise 2a

Set up four regression models and run them via brms:

  1. Store in variable model1_noInnteraction_FE a regression with MAD as dependent variable, and as explanatory variables group and condition (but NOT the interaction between these two).
  2. Store in variable model2_interaction_FE a regression with MAD as dependent variable, and as explanatory variables group, condition and the interaction between these two.
  3. Store in variable model3_interaction_RandSlopes a model like model2_interaction_FE but also adding additionally random effects, namely random intercepts for factor subject_id.
  4. Store in model4_interaction_MaxRE a model like model2_interaction_FE but with the maximal random effects structure licensed by the design of the experiment.
Toggle code
model1_NOinteraction_FE = brm(
  MAD ~ condition + group, 
  data = dolphin
) 

model2_interaction_FE = brm(
  MAD ~ condition * group, 
  data = dolphin
)

model3_interaction_RandSlopes = brm(
  MAD ~ condition * group + (1 | subject_id), 
  data = dolphin
) 

model4_interaction_MaxRE = brm(
  MAD ~ condition * group + (1 + group | exemplar) + (1 | subject_id), 
  data = dolphin
) 
Exercise 2b

This exercise and the next are meant to have you think more deeply about the relation (or unrelatedness) of posterior inference and model comparison. Remember that, conceptually, these are two really different things.

To begin with, look at the summary of posterior estimates for model model2_interaction_FE. Based on these results, what would you expect: is the inclusion of the interaction term relevant for loo-based model comparison? In other words, do you think that model2_interaction_FE is better, equal or worse than model2_NOinteraction_FE under loo-based model comparison? Explain your answer.

Toggle code
model2_interaction_FE
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: MAD ~ condition * group 
   Data: dolphin (Number of observations: 1915) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                     278.44     15.71   248.29   309.11 1.00     2173
conditionTypical             -135.99     18.89  -172.86   -99.45 1.00     2006
grouptouch                   -203.60     21.78  -246.56  -160.87 1.00     1928
conditionTypical:grouptouch   111.81     25.84    59.77   161.80 1.01     1656
                            Tail_ESS
Intercept                       2597
conditionTypical                2498
grouptouch                      2326
conditionTypical:grouptouch     2192

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   267.17      4.38   258.91   276.02 1.00     3371     2789

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The coefficient for the interaction term is credibly different from zero, in fact quite large. We would therefore expect that the data “needs” the interaction term; a model without it is likely to fare worse.

Exercise 2c

Now compare the models directly using loo_compare. Compute the \(p\)-value (following Lambert) and draw conclusion about which, if any, of the two models is notably favored by LOO model comparison.

Toggle code
loo_comp <- loo_compare(
  loo(model1_NOinteraction_FE),
  loo(model2_interaction_FE)
)
loo_comp
                        elpd_diff se_diff
model2_interaction_FE    0.0       0.0   
model1_NOinteraction_FE -7.6       4.7   

The model model2_NOinteraction_FE is worse by -7.5738543 points of expected log predictive density with a standard error of ca. 4.7383212. This translates into a “significant” difference, leading to the conclusion that the model with interaction term is really better:

Toggle code
1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
[1] 0.002287463
Exercise 3d

Now, let’s also compare models that differ only in their random effects structure. We start by looking at the posterior summaries for model4_interaction_MaxRE. Just by looking at the estimated coefficients for the random effects (standard deviations), would you conclude that these variables are important (e.g., that the data provides support for these parameters to be non-negligible)?

Toggle code
model4_interaction_MaxRE
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: MAD ~ condition * group + (1 + group | exemplar) + (1 | subject_id) 
   Data: dolphin (Number of observations: 1915) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~exemplar (Number of levels: 19) 
                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)                36.21     14.54     8.84    67.15 1.00      848
sd(grouptouch)               25.81     16.21     1.29    61.12 1.00     1134
cor(Intercept,grouptouch)    -0.67      0.42    -0.99     0.66 1.00     1904
                          Tail_ESS
sd(Intercept)                  828
sd(grouptouch)                1516
cor(Intercept,grouptouch)     1834

~subject_id (Number of levels: 108) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    77.09      8.68    60.75    94.60 1.00     1944     3137

Population-Level Effects: 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                     279.78     24.37   233.14   329.54 1.00     1902
conditionTypical             -137.54     26.21  -188.59   -84.53 1.00     1936
grouptouch                   -201.87     28.84  -259.86  -145.65 1.00     1840
conditionTypical:grouptouch   110.71     29.68    52.59   170.22 1.00     2269
                            Tail_ESS
Intercept                       2391
conditionTypical                2186
grouptouch                      2296
conditionTypical:grouptouch     2421

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma   254.89      4.30   246.72   263.44 1.00     4880     2815

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

All the parameters from the RE structure, except for the correlation term, are credibly bigger than zero, suggesting that the data lends credence to these factors playing a role in the data-generating process.

Exercise 3e

Compare the models model3_interaction_RandSlopes and model4_interaction_MaxRE with LOO-CV. Compute Lambert’s \(p\)-value and draw conclusions about which, if any, of these models is to be preferred by LOO-CV. Also, comment on the results from 3.b through 3.e in comparison: are the results the same, comparable, different … ; and why so?

Toggle code
loo_comp <- loo_compare(
  loo(model4_interaction_MaxRE),
  loo(model3_interaction_RandSlopes)
)

loo_comp
                              elpd_diff se_diff
model4_interaction_MaxRE       0.0       0.0   
model3_interaction_RandSlopes -1.7       3.8   
Toggle code
1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
[1] 0.980104

LOO-CV finds no noteworthy difference between these models. In tendency, the smaller model is even better. This is different from the previous case in 3.b/c which might have suggested that if a parameter \(\theta\) is credibly different from 0, then we also prefer to have it in the model for predictive accuracy. But this is not the case in the current case (3.d/e) where the smaller model is preferred. An explanation for this could be that not all parameters are equal. The fixed effect terms have more impact on the variability of predictions than random effect terms.

Exercise 3f

Compare all four models using LOO-CV with method loo_compare and interpret the outcome. Which model is, or which models are the best?

Toggle code
loo_compare(
  loo(model1_NOinteraction_FE),
  loo(model2_interaction_FE),
  loo(model3_interaction_RandSlopes),
  loo(model4_interaction_MaxRE)
)
                              elpd_diff se_diff
model4_interaction_MaxRE        0.0       0.0  
model3_interaction_RandSlopes  -1.7       3.8  
model2_interaction_FE         -47.8      13.6  
model1_NOinteraction_FE       -55.3      14.7  
Toggle code
# model 4 is best, followed by model 3. 
# As models 3 and 4 are incomparable by LOO-CV (from part 3.c), 
# we only need to check if model 4 is better than model 2, which is on third place:

loo_compr <- loo_compare(
  loo(model4_interaction_MaxRE),
  loo(model2_interaction_FE)
)

# We find that this difference is substantial:
1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
[1] 0.980104