Bayesian regression: theory & practice

04a: MCMC diagnostics (demonstrations)

Author

Michael Franke

Load relevant packages and “set the scene.”

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Code

# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "remotes",
  "tidybayes",
  "shinystan"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors()[c(1,3,4,5,2,6:14),"hex", drop = TRUE]

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}

dolphin <- aida::data_MT
my_scale <- function(x) c(scale(x))

This tutorial provides demonstrations of how to check the quality of MCMC samples obtained from brms model fits.

A good model

To have something to go on, here are two model fits, one of this is good, the other is … total crap. The first model fits a smooth line to the average world temperature. (We need to set the seed here to have reproducible results.)

fit_good <- brm(
  formula = avg_temp ~ s(year), 
  data = aida::data_WorldTemp,
  seed = 1969
)

The good model is rather well behaved. Here is a generic plot of its posterior fits and traceplots:

plot(fit_good)

Traceplots look like madly-in-love caterpillars doing their thing.

We can check \(\hat{R}\) and effective sample sizes also in the summary of the model:

summary(fit_good)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: avg_temp ~ s(year) 
   Data: aida::data_WorldTemp (Number of observations: 269) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1)     3.56      1.08     1.96     6.17 1.00      907     1689

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     8.31      0.02     8.27     8.35 1.00     3732     2652
syear_1      14.55      2.25    10.11    19.08 1.00     2182     2371

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.33      0.01     0.30     0.36 1.00     3823     2962

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Interestingly, there is one warning message about one divergent transition. We are recommended to check the pairs() plot, so here goes:

pairs(fit_good)

This is actually not too bad. (Wait until you see a terrible case below!)

We can try to fix this problem with a single divergent transition by doing as recommended by the warning message, namely increasing the adapt_delta parameter in the control structure:

fit_good_adapt <- brm(
  formula = avg_temp ~ s(year), 
  data = aida::data_WorldTemp,
  seed = 1969,
  control = list(adapt_delta=0.9),
) 

summary(fit_good_adapt)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: avg_temp ~ s(year) 
   Data: aida::data_WorldTemp (Number of observations: 269) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1)     3.59      1.09     1.97     6.30 1.00     1003     1613

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     8.31      0.02     8.27     8.35 1.00     3632     2619
syear_1      14.59      2.40     9.96    19.34 1.00     2425     2510

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.33      0.01     0.30     0.36 1.00     3588     2868

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

That looks better, but what did we just do? — When the sampler “warms up”, it tries to find good parameter values for the case at hand. The adapt_delta parameter is the minimum amount of accepted proposals (where to jump next) before “warm up” counts as done and successfull. So with a small problem like this, just making the adaptation more ambitious may have have solved the problem. It has also, however, made the sampling slower, less efficient.

A powerful interactive tool for exploring a fitted model (diagnostics and more) is shinystan:

shinystan::launch_shinystan(fit_good_adapt)

A terrible model

The main (maybe only) reason for serious problems with the NUTS sampling is this: sampling issues arise for bad models. So, let’s come up with a really stupid model.

Here’s a model that is like the previous but adds a second predictor, which is a normal (non-smoothed) regression coefficient that is almost identical to the original year information. You may already intuit that this cannot possibly be a good idea; the model is notionally deficient. So, we exepct nightmares during sampling:

fit_bad <- brm(
  formula = avg_temp ~ s(year) + year_perturbed, 
  data = aida::data_WorldTemp |> mutate(year_perturbed = rnorm(1,year,0.001)),
  seed = 1969
) 

summary(fit_bad)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: avg_temp ~ s(year) + year_perturbed 
   Data: mutate(aida::data_WorldTemp, year_perturbed = rnor (Number of observations: 269) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1)     3.62      1.09     2.01     6.24 1.00     1239     1594

Population-Level Effects: 
                       Estimate        Est.Error          l-95% CI
Intercept      2225135220368.58 2790632812490.53 -1139137522767.06
year_perturbed   -1271506170.07    1594647735.12    -4693478867.37
syear_1                   14.53             2.32             10.16
                       u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      8213585887067.27 1.76        6       35
year_perturbed     650935896.17 1.76        6       35
syear_1                   19.27 1.00     2285     2953

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.33      0.01     0.30     0.36 1.00     4150     2656

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Indeed, that looks pretty bad. We managed to score badly on all major accounts:

large \(\hat{R}\)
extremely poor efficient sample size
ridiculously far ranging posterior estimates for the main model components
tons of divergent transitions
maximum treedepth reached more often than hipster touches their phone in a week

Some of these caterpillars look like they are in a vicious rose war:

plot(fit_bad)

We also see that that the intercept of and the slope for year_perturbed are the main troublemakers (in terms of traceplots).

Interestingly, a simple posterior check doesn’t look half-bad:

pp_check(fit_bad)

But now, have a look at the pairs() plot:

pairs(fit_bad)

Aha, there we see a clear problem! The joint posterior for the intercept and the slope for year_perturbed looks like a line. This means that these parameters could in princple do the same “job”.

This suggests a possible solution stragey. The model is too unconstrained. It can allow these two parameters meander to wherever they want (or so it seems). We could therefore try honing them in by specifying priors, like so:

fit_bad_wPrior <- brm(
  formula = avg_temp ~ s(year) + year_perturbed, 
  data = aida::data_WorldTemp |> mutate(year_perturbed = rnorm(1,year,0.001)),
  seed = 1969,
  prior = prior("student_t(1,0,5)", coef = "year_perturbed")
) 

summary(fit_bad_wPrior)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: avg_temp ~ s(year) + year_perturbed 
   Data: mutate(aida::data_WorldTemp, year_perturbed = rnor (Number of observations: 269) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(syear_1)     3.62      1.08     2.04     6.16 1.00     1074     1808

Population-Level Effects: 
               Estimate Est.Error  l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept       -187.55  58370.12 -91895.80 96032.53 1.02     2239      613
year_perturbed     0.11     33.35    -54.87    52.52 1.02     2239      613
syear_1           14.60      2.32      9.93    19.16 1.00     2006     2539

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.33      0.01     0.30     0.36 1.00     3180     2729

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Well, alright! That isn’t too bad anymore. But it is still clear from the posterior pairs plot that this model has two parameters that steal each other’s show. The model remains a bad model … for our data.

pairs(fit_bad)

Here’s what’s wrong: year_perturbed is a constant! The model is a crappy model of the data, because the data is not what we thought it would be. Check it out:

aida::data_WorldTemp |> mutate(year_perturbed = rnorm(1,year,0.001))

# A tibble: 269 × 5
    year anomaly uncertainty avg_temp year_perturbed
   <dbl>   <dbl>       <dbl>    <dbl>          <dbl>
 1  1750  -1.41        NA        7.20          1750.
 2  1751  -1.52        NA        7.09          1750.
 3  1753  -1.07         1.3      7.54          1750.
 4  1754  -0.614        1.09     8.00          1750.
 5  1755  -0.823        1.24     7.79          1750.
 6  1756  -0.547        1.28     8.06          1750.
 7  1757  -0.438        1.31     8.17          1750.
 8  1758  -2.42         1.76     6.19          1750.
 9  1759  -1.53         2.25     7.08          1750.
10  1760  -2.46         2.75     6.14          1750.
# … with 259 more rows

So, we basically ran a model with two intercepts!?!

Let’s try again:

data_WorldTemp_perturbed <- aida::data_WorldTemp |> 
    mutate(year_perturbed = rnorm(nrow(aida::data_WorldTemp),year, 50))
data_WorldTemp_perturbed

# A tibble: 269 × 5
    year anomaly uncertainty avg_temp year_perturbed
   <dbl>   <dbl>       <dbl>    <dbl>          <dbl>
 1  1750  -1.41        NA        7.20          1813.
 2  1751  -1.52        NA        7.09          1697.
 3  1753  -1.07         1.3      7.54          1688.
 4  1754  -0.614        1.09     8.00          1675.
 5  1755  -0.823        1.24     7.79          1687.
 6  1756  -0.547        1.28     8.06          1778.
 7  1757  -0.438        1.31     8.17          1674.
 8  1758  -2.42         1.76     6.19          1766.
 9  1759  -1.53         2.25     7.08          1672.
10  1760  -2.46         2.75     6.14          1806.
# … with 259 more rows

That’s more like what we thought it was: year_perturbed is supposed to be noisy version of the actual year. So, let’s try again, leaving out the smoothing, just for some more chaos-loving fun:

fit_bad_2 <- brm(
  formula = avg_temp ~ year + year_perturbed, 
  data = data_WorldTemp_perturbed,
  seed = 1969,
  prior = prior("student_t(1,0,5)", coef = "year_perturbed")
) 

summary(fit_bad_2)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: avg_temp ~ year + year_perturbed 
   Data: data_WorldTemp_perturbed (Number of observations: 269) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -3.45      0.58    -4.58    -2.32 1.00     4456     3133
year               0.01      0.00     0.01     0.01 1.00     4059     2558
year_perturbed    -0.00      0.00    -0.00     0.00 1.00     4094     2708

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.40      0.02     0.37     0.44 1.00     1456     1573

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

There are no warnings, so this model must be good, right? – No!

If we check the pairs plot, we see that we now have introduced a fair correlation between the two predictor variables.

pairs(fit_bad_2)

We should just not have year_perturbed; it’s nonsense, and it shows in the diagnostics.

You can diagnose more using shinystan:

shinystan::launch_shinystan(fit_bad)