Bayesian regression: theory & practice

03a: Generalized linear models

Author

Michael Franke

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Code
# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "remotes",
  "tidybayes",
  "bridgesampling",
  "shinystan"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors()[c(1,3,4,5,2,6:14),"hex", drop = TRUE]

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}
dolphin <- aida::data_MT

This tutorial covers common types of generalized linear regression models (GLMs):

The shared form of all of these GLMs is the following “feed-forward computation” (here illustrated for a single datum of the predicted variable \(y\) for a vector \(x\) of predictor variables and a vector of coefficients \(\beta\):

  1. compute a linear predictor: \(\xi = x \cdot \beta\);
  2. compute a predictor of central tendency using an appropriate link function \(\text{LF}\): \(\eta = \text{LF}(\xi ; \theta_{\text{LF}})\);
  3. determine the likelihood function \(\text{LH}\): \(y \sim \text{LH}(\eta; \theta_{\text{LH}})\).

Link function and likelihood function may have additional free parameters, \(\theta_{\text{LF}}\) and \(\theta_{\text{LH}}\), to be fitted alongside the regression coefficients.

Simple linear regression is the special case of this scheme where the link function is just the identity function and the likelihood is given by $y \sim \mathcal{N}(\eta; \sigma)$. Different types of regression are used to account for different kinds predicted variable \(y\):

type of \(y\) (inverse) link function likelihood function
metric \(\eta = \xi\) \(y \sim \text{Normal}(\eta; \sigma)\)
binary \(\eta = \text{logistic}(\xi)\) \(y \sim \text{Bernoulli}(\eta)\)
nominal \(\eta = \text{soft-max}(\xi)\) \(y \sim \text{Categorical}({\eta})\)
ordinal \(\eta = \text{cumulative-logit}(\xi; {\delta})\) \(y \sim \text{Categorical}({\eta})\)
count \(\eta = \exp(\xi)\) \(y \sim \text{Poisson}(\eta)\)

Logistic regression

Explanation

In logistic regression, the response variable \(y\) is binary, i.e., we want to predict the probability \(p\) with which one of two possible outcomes (henceforth: the reference outcome) occurs. The likelihood function for this case is the Bernoulli distribution. This requires a link function \(LF\) that maps real-valued linear predictor values \(\xi\) onto the unit interval. A common choice is the logistic function:

\[ \text{logistic}(\xi) = \frac{1}{1+ \exp(-\xi)} = \eta \]

The logistic regression model is then defined as:

\[ \begin{align*} \xi &= x \cdot \beta && \color{gray}{\text{[linear predictor]}} \\ \eta &= \text{logistic}(\xi) && \color{gray}{\text{[predictor of central tendency]}} \\ y & \sim \text{Bernoulli}(\eta) && \color{gray}{\text{[likelihood]}} \end{align*} \]

The linear predictor values \(\xi\) can be interpreted directly, as the log odds-ratio of the predicted probability \(\eta\). This is because the inverse of the logistic function is the logit function, which has the following form:

\[ \text{logit}(\eta) = \log \frac{\eta}{1-\eta} = \xi \]

logit = function(x) return( log(x/(1-x)) )
ggplot(data.frame(x = c(0.001,1-0.001)), aes(x)) +
         stat_function(fun = logit, color = project_colors[2], size = 2) +
  labs(label = "logit function", x = latex2exp::TeX("$\\eta$"), y = latex2exp::TeX("$\\xi$ = logistic($\\eta$)")) +
  ggtitle("logit function")

That also means that a difference in linear predictor parameters, e.g., in a logistic regression with a single binary categorical predictor variable (group A vs. group B), can be interpreted directly as something like the “evidence ratio” or “Bayes factor”. It is the log of the factor by which to transform log odds-ratios (e.g., changing beliefs from \(\eta_1\) to \(\eta_2\):

\[ \begin{align*} & \xi_1 - \xi_2 = \log \frac{\eta_1}{1-\eta_1} - \log \frac{\eta_2}{1-\eta_2} = \log \left ( \frac{\eta_1}{1-\eta_1} \frac{1-\eta_2}{\eta_2}\right ) \\ \Leftrightarrow & \frac{\eta_1}{1-\eta_1} = \exp (\xi_1 - \xi_2) \ \frac{\eta_2}{1-\eta_2} \end{align*} \]

For the purposes of understanding which priors are weakly or strongly informative, a unit difference in the linear predictor can be interpreted as a log Bayes factor (changing prior odds to posterior odds). So a unit difference in the predictor value corresponds to a Bayes factor of around 2.72.

Example

Our hypothesis is that typical examples are easier to classify, so they should have higher accuracy than atypical ones. We are also interested in additional effects of group on accuracy.

As usual, we begin by plotting the relevant data.

sum_stats <- dolphin |> 
  group_by(group, condition) |> 
  tidyboot::tidyboot_mean(correct) |> 
  rename(accuracy = empirical_stat)
  
sum_stats
# A tibble: 4 × 7
# Groups:   group [2]
  group condition     n accuracy ci_lower  mean ci_upper
  <chr> <chr>     <int>    <dbl>    <dbl> <dbl>    <dbl>
1 click Atypical    318    0.874    0.833 0.874    0.910
2 click Typical     689    0.964    0.947 0.963    0.976
3 touch Atypical    330    0.909    0.878 0.909    0.940
4 touch Typical     715    0.941    0.923 0.941    0.958
sum_stats |> 
  ggplot(aes(x = condition, y = accuracy, group = group, color = group)) +
  geom_line(size = 1, position = position_dodge(0.2)) +
  geom_point(size = 3, position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), 
                width = 0.1, size = 0.35, position = position_dodge(0.2))

Visually, there might be a hint that typical trials had higher accuracy, but we cannot judge with the naked eye whether this is substantial.

A logistic regression, regressing correct against group * condition, may tell us more. To run the logistic regression, we must tell the brms:brm() that we want to treat 0 and 1 as categories. To be sure, and also to directly dictate which of the two categories is the reference level, we use a factor (of strings) with explicit ordering.

fit_logistic <- brm(
  formula = correct ~ group * condition,
  data = dolphin |> 
    mutate(correct = factor(ifelse(correct, "correct", "incorrect"),
                            levels = c("incorrect", "correct"))),
  family = bernoulli()
)

summary(fit_logistic)
 Family: bernoulli 
  Links: mu = logit 
Formula: correct ~ group * condition 
   Data: mutate(dolphin, correct = factor(ifelse(correct, " (Number of observations: 2052) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                       1.95      0.17     1.63     2.30 1.00     2679
grouptouch                      0.36      0.25    -0.12     0.86 1.00     1989
conditionTypical                1.34      0.27     0.83     1.87 1.00     1817
grouptouch:conditionTypical    -0.87      0.37    -1.59    -0.16 1.00     1671
                            Tail_ESS
Intercept                       2859
grouptouch                      2252
conditionTypical                2346
grouptouch:conditionTypical     2169

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

To test whether typical examples had credibly higher accuracy, the faintr package can be used like so:

compare_groups(
  fit_logistic,
  higher = condition == "Typical",
  lower  = condition == "Atypical"
)
Outcome of comparing groups: 
 * higher:  condition == "Typical" 
 * lower:   condition == "Atypical" 
Mean 'higher - lower':  0.9049 
95% HDI:  [ 0.5539 ; 1.283 ]
P('higher - lower' > 0):  1 
Posterior odds:  Inf 

Based on these results, we may conclude that, given the model and the data, we should believe that typical examples had higher accuracy.

Exercise 1a

Test whether there is reason to believe, given model and data, that the touch group was more accurate than the click group. (After all, the click group could change their minds until the very last moment.)

Show solution
compare_groups(
  fit_logistic,
  higher = group == "click",
  lower  = group == "touch"
)

# there is no reason to believe (given model and data) that this conjecture is true

Exercise 1b

If you look back at the plot of accuracy, it looks as if the change from atypical to typical condition does not have the same effect, at least not at the same level of strength, for the click and the touch group, i.e., it seems that there is an interaction between these two variables (group and condition). Use the function brms::hypothesis() to examine the interaction term of the model fit. What do you conclude from this?

Show solution
brms::hypothesis(fit_logistic, "grouptouch:conditionTypical < 0")

# given model and data, it is very plausible to believe that there is an interaction between these two variables.

Multinomial regression

Explanation

In multinomial regression the predicted variable is categorical with more than two levels: \(c_1, \dots, c_k\), \(k > 2\). We want to predict probabilities for each category \(p_1, \dots, p_k\) (with some linear predictors, more on this in a moment). To obtain the probabilities, we estimate a set of weights (so-called logits): \(s_1, \dots, s_k\). By default, we set \(s_1 = 0\). (We only need \(k-1\) numbers to define a \(k\)-place probability vector (given that it must sum to one).) For all \(1 \le i \le k\), we define the probability \(p_i\) of category \(i\) via the following (so-called soft-max operation):

\[ p_i = \frac{\exp s_i}{ \sum_{j=1}^k \exp s_j} \]

This entails that for every \(1 < i \le k\), the score \(s_i\) can be interpreted as the log-odds of category \(c_i\) over the reference category \(c_1\):

\[ s_i = \log \frac{p_i}{p_1} \]

Finally, we do not just estimate any-old vector of logits, but we assume that each logit \(s_i\) (\(1 < i \le k\)) is estimated as a linear predictor (based on the usual linear regression predictor coefficients, appropriate to the type of the \(l\) explanatory variables):

\[ s_i = \beta^i_0 + \beta^i_1 x_1 + \beta^i_2 x_2 + \dots + \beta^i_l x_l \]

Two things are important for interpreting the outcome of a multinomial regression fit:

  1. each category (beyond the reference category) receives its own (independent) set of regression coefficients;
  2. the linear predictor predictor \(s_i\) for category \(c_i\) can be interpreted as the log-odds of the \(i\)-th category over the first, reference category.

Example

Our next research question is slightly diffuse: we want to explore whether the distribution of trajectory types is affected by whether the correct target was on the right or the left. We only consider three types of categories (curved, straight and ‘change of mind’) and prepare the data to also give us the information whether the ‘correct’ target was left or right.

dolphin_prepped <-
  dolphin |>
  mutate(
    prototype_label = case_when(
     prototype_label %in% c('curved', 'straight') ~ prototype_label,
     TRUE ~ 'CoM'
    ),
    prototype_label = factor(prototype_label,
                             levels = c('straight', 'curved', 'CoM')),
    target_position = ifelse(category_left == category_correct, "left", "right")
    )

The relevant data now looks as follows:

dolphin_prepped |> 
  select(prototype_label, target_position)
# A tibble: 2,052 × 2
   prototype_label target_position
   <fct>           <chr>          
 1 straight        left           
 2 straight        right          
 3 curved          right          
 4 curved          left           
 5 CoM             left           
 6 CoM             right          
 7 CoM             right          
 8 straight        left           
 9 straight        left           
10 straight        left           
# … with 2,042 more rows

The counts and proportions we care about are these:

sum_stats <- dolphin_prepped |> 
  count(target_position, prototype_label) |>
  group_by(target_position) |> 
  mutate(proportion = n / sum(n))

sum_stats
# A tibble: 6 × 4
# Groups:   target_position [2]
  target_position prototype_label     n proportion
  <chr>           <fct>           <int>      <dbl>
1 left            straight          751     0.734 
2 left            curved            136     0.133 
3 left            CoM               136     0.133 
4 right           straight          793     0.771 
5 right           curved             93     0.0904
6 right           CoM               143     0.139 

And here is a plot that might be useful to address your current issue:

sum_stats |> 
  ggplot(aes(x = prototype_label, y = proportion, fill = prototype_label)) +
  geom_col() +
  facet_grid(. ~ target_position)

It is hard to say from visual inspection alone, whether there are any noteworthy differences. We might consider the following:

  • Conjecture: the /difference/ in probability between straight vs curved is higher when the target is on the right than when it is on the left.

This is not a real “research hypothesis” but a conjecture about the data. Let’s still run a multinomial regression model to test address this conjecture.

fit_multinom <- brm(
  formula = prototype_label ~ target_position,
  data = dolphin_prepped,
  family = categorical()
)

The summary of this model fit is a bit unwieldy:

summary(fit_multinom)
 Family: categorical 
  Links: mucurved = logit; muCoM = logit 
Formula: prototype_label ~ target_position 
   Data: dolphin_prepped (Number of observations: 2052) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                              Estimate Est.Error l-95% CI u-95% CI Rhat
mucurved_Intercept               -1.71      0.09    -1.90    -1.53 1.00
muCoM_Intercept                  -1.71      0.09    -1.89    -1.53 1.00
mucurved_target_positionright    -0.44      0.14    -0.71    -0.16 1.00
muCoM_target_positionright       -0.01      0.13    -0.26     0.26 1.00
                              Bulk_ESS Tail_ESS
mucurved_Intercept                4924     2908
muCoM_Intercept                   4054     2982
mucurved_target_positionright     4103     3054
muCoM_target_positionright        3979     2908

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

For better visibility here is a plot of the posteriors over relevant model parameters.

# there MUST be a nicer way of doing this, but ...
ordered_names <- c(
  "b_mucurved_Intercept", 
  "b_muCoM_Intercept",
  "b_mucurved_target_positionright",
  "b_muCoM_target_positionright"
)

fit_multinom |> 
  tidybayes::tidy_draws() |> 
  pivot_longer(cols = starts_with("b_")) |> 
  select(name, value) |> 
  mutate(name = factor(name, levels = rev(ordered_names))) |> 
  ggplot(aes(x = value, y = name)) +
  tidybayes::stat_halfeye() +
  geom_vline(aes(xintercept = 0), color = project_colors[3], alpha= 1, size = 1)

Exercise 2a

Look at the names of the coefficients in the fit summary to find out:What is the reference level for the categorical predictor variable?

Show solution
# It's the 'left' position, because there is a coefficient for the 'right' position.

Exercise 2b

Look at the names of the coefficients in the fit summary to find out: What is the reference level of the categories to be predicted in the multinomial model?

Show solution
# The reference category is 'straight' because we have regression coeffiecient for all but the 'straight' category.

Exercise 2c

Can you extract information about our conjecture from this plot (or the summary of the model fit)?

Show solution
# Yes! Our conjecture is about the difference in probability of the 'straight' vs he 'curved' category. This difference is directly encoded in regression coefficients. Concretely, the coefficient 'mucurved_Intercept' gives us the log odds of the 'straight' vs' the 'curved' category for the 'left'-position cases. The difference of log odds for the 'right'-position cases is simply the coefficient 'mucurved_target_positionright'. The is credibly smaller than zero (by a margin), so we may conclude that model and data provide support for our conjecture.

Exercise 2d

Use the postrior means of the regression coefficients to compute the corresponding scores \(s_i\) and class probabilities \(c_i\). Compare these to the observed frequencies.

Show solution
# extract mean posteriors
posterior_means <- fit_multinom |> tidybayes::summarise_draws() |> 
  select(variable, mean) |> 
  pivot_wider(names_from = variable, values_from = mean)

as.numeric(posterior_means, names = colnames(posterior_means))  

scores_left <- c(
  0,
  posterior_means[1,"b_mucurved_Intercept"] |> as.numeric(),
  posterior_means[1,"b_muCoM_Intercept"] |> as.numeric()
)

scores_right <- c(
  0,
  posterior_means[1,"b_mucurved_Intercept"] |> as.numeric() + posterior_means[1,"b_mucurved_target_positionright"] |> as.numeric(),
  posterior_means[1,"b_muCoM_Intercept"] |> as.numeric() + posterior_means[1,"b_muCoM_target_positionright"] |> as.numeric()
)

probabilities_left <- prop.table(exp(scores_left))
probabilities_right <- prop.table(exp(scores_right))

sum_stats |> ungroup() |> 
  mutate(prediction = c(probabilities_left, probabilities_right))

Ordinal regression

Explanation

When \(k>2\) categories have a natural ordering, the problem of predicting probabilities for each category can be simplified by taking this ordering into account. A common choice of link function for this case is the cumulative logit function which takes the linear predictor and a vector \(\delta\) of \(k-1\) thresholds as arguments to return a probability vector, here denoted as \(\eta\), whose components are defined like so:

\[ \eta_i = \text{cumulative-logit}(\xi; \delta) = \begin{cases} \text{logistic}(\delta_1 - \xi) & \text{if } i=1 \\ \text{logistic}(\delta_{i} - \xi) - \eta_i & \text{if } i>1 \\ \end{cases} \] To see what is going on, consider the a case with three categories. Fix the two threshold \(\delta_1=-0.75\) and \(\delta_2=1.6\) just for illustration. Now assume that we have a case there the linear predictor value \(\eta\) is zero. The cumulative logit function above then entails the category probabilities as shown in this plot, as the length of the colored bar segments:

If the linear predictor \(\eta\) is estimated to be bigger than zero, this intuitively means that we shift all of the threshold to the left (by the same amount). For example, the plot below shows the case of \(\eta=1\) where the probability of the first category decreases while that of the third increases.

In sum, the cumulative-logit model for ordinal regression, is defined as follows:

\[ \begin{align*} \xi &= X \beta && \color{gray}{\text{[linear predictor]}} \\ \eta &= \text{cumulative-logit}(\xi; \delta) && \color{gray}{\text{[predictor of central tendency]}} \\ y & \sim \text{Categorical}(\eta) && \color{gray}{\text{[likelihood]}} \end{align*} \]

Example

The kind of mouse-trajectories, as categorized in variable prototype_label, are plausibly ordered by the “amount of deviation”. The following therefore tries to predict the ordered category prototype_label from the numerical measure MAD. Here is a plot of how this would look like:

# prepare data by making 'prototype_label' an ordered factor
dolphin_prepped2 <- dolphin_prepped |> 
    mutate(prototype_label = factor(prototype_label, ordered = T))

# plotting the ordered categories as a function of MAD
dolphin_prepped2 |> 
  ggplot(aes(x = MAD, y = prototype_label, 
             color = prototype_label)) +
  geom_jitter(alpha = 0.3,height = 0.3, width = 0)

To run an ordinal regression model, we specify family = cumulative(). This runs the default cumulative-logit model introduced at the beginning of the session.

fit_ordinal <- brm(
  formula = prototype_label ~ MAD,
  data = dolphin_prepped2,
  family = cumulative()
)

The summary output for this fitted model gives information about the slope of the predictor variable MAD as usual. But it also supplies information about two (!) intercepts: these are the cutoff points for the different categories in the cumulative normal link function.

summary(fit_ordinal)
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: prototype_label ~ MAD 
   Data: dolphin_prepped2 (Number of observations: 2052) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]     4.04      0.18     3.71     4.40 1.00     2283     2303
Intercept[2]     9.49      0.50     8.52    10.47 1.00     1890     1597
MAD              0.02      0.00     0.02     0.03 1.00     2248     2486

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Concretely, this means that if MAD is $x$, we compute the linear predictor \(\xi = \beta_{\text{MAD}} x\) (without intercept!), and then “imagine around it” a standard normal distribution, i.e., consider the normal distribution $\mathcal{\xi, 1}$. The probabilities of the three categories are given by the areas under the curve of this normal distribution, namely:

  • probability category 1: area from \(-\infty\) to first threshold,

  • probability category 2: area between first and second threshold,

  • probability category 3: area between second threshold and \(\infty\).

We can operator with the linear regression coefficients as usual, e.g., asking whether there is any reason to believe, given model and data, that the higher MAD, the higher the probability of seeing a more ‘uncertain’ trajectory type.

fit_ordinal |> 
  tidybayes::gather_draws(b_MAD) |> 
  ggplot(aes(x = .value, y = .variable)) +
  tidybayes::stat_halfeye() +
  ylab("") + xlab("") + ggplot2::xlim(0,0.03)

brms::hypothesis(fit_ordinal, "MAD > 0")
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1  (MAD) > 0     0.02         0     0.02     0.03        Inf         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Poisson regression

Explanation

The Poisson distribution is the common choice for count data. It is defined as:

\[ \text{Poisson}(k ; \lambda) = \frac{\lambda^k \ \exp( -\lambda)} {k!} \]

The link function is the exponential function (so the inverse link function is the logarithmic function). The Poisson regression model is defined as:

\[ \begin{align*} \xi &= X \beta && \color{gray}{\text{[linear predictor]}} \\ \eta_i &= \exp(\xi_i) && \color{gray}{\text{[predictor of central tendency]}} \\ y_i & \sim \text{Poisson}(\eta_i) && \color{gray}{\text{[likelihood]}} \end{align*} \]

Example

There are examples in the next exercise sheet. For a tutorial on Poisson regression specifically geared towards linguists see here.