Bayesian regression: theory & practice

02b: Categorical predictors (tutorial)

Author

Michael Franke & Polina Tsvilodub

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Code
# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "remotes",
  "tidybayes"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors()[c(1,3,4,5,2,6:14),"hex", drop = TRUE]

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}

Generalized (non-)linear mixed effect models are a powerful statistical tool that gains increasing popularity for data analysis in cognitive science and many other disciplines. This tutorial will provide an overview of different categorical variable coding schemes used in mixed effect models. We will look at two example data sets from factorial-design experiments with categorical predictors and a continuous dependent variable which we will analyze using a Bayesian approach.

Introduction

Whereas models involving continuous predictors like height, weight or reaction time might be interpreted rather intuitively, models involving (very common!) categorical variables like gender or experimental conditions (did the participant read text1 or text2?) are often confusing. However, it is important to understand how to deal with such categorical variables mathematically because different decisions we make about the mathematical procedures influence the interpretation of the regression results.

This hands-on tutorial should provide an overview of different categorical variable coding schemes and walk you through applying and interpreting the output of a regression model fitted to variables coded in those different schemes. This tutorial presupposes a basic familiarity with R, linear regression and Bayesian data analysis with the brms R package (Buerkner, 2016). However, all the details of Bayesian analysis need not be understood. Avoiding too many mathematical details, the focus of this tutorial is rather conceptual - to convey in intuitive terms how categorical variables are translated to mathematical terms and how differences in this ‘translation’ influence the interpretation of a regression model. These conceptual ideas are also not bound to Bayesian analysis and are directly applicable in frequentist modeling tools (e.g. lmer package in R).

Most importantly, we should remember: any statistics of the data are just a tool for us to be able to draw conclusions about our research questions. The numbers that R (or any other application) computes don’t tell us anything - it’s our understanding of those tools and the interpretation of the numbers that provides answers to scientific questions!

First, we will shortly recap how linear regression models look like. Then, we will roughly familiarize ourselves with the data sets. Then, step by step we will look at different coding schemes: how they are applied in R and what a fitted regression model looks like.

Estimated reading time: 1.5h

Categorical Variables Coding Schemes

Why do we need to code categorical variables like gender or experimental conditions? ‘We don’t have to code continuous variables like reaction time!’, you might say. The reason is that categorical variables (also called factors) are usually represented in a data set as strings, e.g. the factor gender could contain the levels (i.e. distinct categories) female and male. But we cannot represent different strings in a regression model. So we need to numerically indicate that a predictor has different levels - by (implicitly) recoding the levels of a variable to numbers, e.g. a 1 for each occurrence of female and -1 for each occurrence of male in our data set. How we choose these numbers is the focus of this tutorial.

In common terms, by recoding the levels numerically we indicate the contrasts between the levels. Generally, we need N-1 numeric variables to represent the contrasts of a categorical variable with N levels.

To get a better understanding of what this means, we will dive right into an example.

Preparing the tools

We will use the following R packages in this tutorial. If you are missing some of them, you can install them by running e.g. install.packages('tidyverse'). NB: we will fit the models using brm - this might take some time to compile!

library(tidyverse)
library(tidyboot)
library(brms)
library(lmerTest)
library(broom)
library(languageR)

Dataset

The first part of this tutorial is based on a data set from an experiment by Winter and Grawunder (2012) You can get the dataset by running:

politeness_df <- faintr::politeness

# get a look at the data set
head(politeness_df)
# A tibble: 6 × 5
  subject gender sentence context pitch
  <chr>   <chr>  <chr>    <chr>   <dbl>
1 F1      F      S1       pol      213.
2 F1      F      S1       inf      204.
3 F1      F      S2       pol      285.
4 F1      F      S2       inf      260.
5 F1      F      S3       pol      204.
6 F1      F      S3       inf      287.

The data contains records of the voice pitch of speakers in different social contexts (polite and informal). They investigated whether the mean voice pitch differs across the factor gender of the speakers (F and M) and across the factor contexts - resulting in four different condition combinations (gender X context). Such a design is called factorial design and the single combinations are called design cells.

Explore Data visually

Before we dive into any statistical analyses of our dataset it is helpful to get a rough idea of what the data looks like. For example, we can start by exploring the dataset visually.

Furthermore, we can compute some basic statistics - e.g. the mean of the different design cells, before we turn to more complex linear models. We can also compute the overall mean pitch across all the conditions - the grand mean. These values will be helpful for a sanity check when interpreting the linear models later on.

tibble_means <- politeness_df %>%
  group_by(context, gender) %>%
  summarize(mean = mean(pitch))
head(tibble_means)
# A tibble: 4 × 3
# Groups:   context [2]
  context gender  mean
  <chr>   <chr>  <dbl>
1 inf     F       261.
2 inf     M       144.
3 pol     F       233.
4 pol     M       133.
mean(tibble_means$mean)
[1] 192.8605

Regression Models

While this tutorial presupposes some basic understanding of regression models, let us recap what exactly we compute when fitting a regression model. As usual for factorial designs, the regression model works under the assumption that observations in each cell are sampled from a normal distribution with a certain mean - and each design cell has an own mean. Research questions about data from factorial designs are usually about the estimated means of these design cells, e.g.: is this design cell’s mean different from zero, or from that design cell’s mean?

Let us stick to our example data. We want to estimate the mean pitch (\(y\)) depending on gender (\(x_1\)), context (\(x_2\)) and the interaction of the two factors (\(x_1x_2\)). So what we estimate is the regression function described by

\[y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_1x_2 + \epsilon\]

\(\beta_0\) is the intercept and \(\beta_1, \beta_2\) and \(\beta_3\) are the slopes that are estimated when fitting the model; \(x_1\) and \(x_2\) represent the categorical variables and are exactly what is ‘coded’ when applying different variable coding schemes. For simplicity we will omit the error term \(\epsilon\) below. Will will come back to this regression function when interpreting the regression output given respective coding schemes.

Coding Categorical Variables

In our example, gender and context are categorical variables, so they need to be coded numerically. When categorical variables are coded for regression modeling, they are internally set to factors. Let’s do this manually here for explicitness.

politeness_df <- politeness_df %>%
  mutate(gender = factor(gender),   # you could rename or reorder the levels here
         context = factor(context))
# check
head(politeness_df)
# A tibble: 6 × 5
  subject gender sentence context pitch
  <chr>   <fct>  <chr>    <fct>   <dbl>
1 F1      F      S1       pol      213.
2 F1      F      S1       inf      204.
3 F1      F      S2       pol      285.
4 F1      F      S2       inf      260.
5 F1      F      S3       pol      204.
6 F1      F      S3       inf      287.

Dummy (Treatment) Coding

Dummy coding, or treatment coding, is the default coding scheme used by R. Understanding the name ‘treatment coding’ helps understanding what this coding scheme does: imagine a medical experiment with a single control group (who obtain a placebo) and different experimental groups each of which gets a different treatment (e.g., different drugs), and where we want to compare each treatment group to the single, pivotal control group. Consequently, dummy coded variables are estimated by comparing all levels of the variable to a reference level. The intercept of a linear model containing dummy-coded variables is the mean of the reference level.

Our variables only have two levels, so the effect of gender could be estimated by treating female as the reference level and estimating the effect of being male compared to the reference level – so basically estimating the difference in pitch it takes to “get from female to male”. Similarly, we can estimate the effect of context: the informal context can be treated as the reference level and the effect of politeness can be estimated against it. By default, the first level of a factor is treated as the reference level (for unordered factors that is the first string in alphanumeric order) - but principally, there is no difference as to which level should be used as the reference level. It makes sense to choose the level which is in some sense the ‘control’ in your experimental design.

Because R uses dummy-coding by default, we can look at the default numerical coding right away. The function contrasts() displays the contrast matrix for the respective variable:

contrasts(politeness_df$gender)
  M
F 0
M 1
contrasts(politeness_df$context)
    pol
inf   0
pol   1

But if we wish to explicitly assign a dummy (treatment) coding to a variable, we may do so by a built-in R function:

contrasts(politeness_df$gender) <- contr.treatment(2) # insert the number of levels here
# check
contrasts(politeness_df$gender)
  2
F 0
M 1

So both variables \(x_1\) and \(x_2\) can take either the value 0 or 1 (because we dummy-code both categorical variables; see below for more). We already defined the referenc levels of the single variables, now we can define the overall reference level of our model (by combining the two individual reference levels) – it is the mean pitch of female speakers in informal contexts.

Having set all the basics, we can now turn to computing a linear model of the mean pitch as predicted by the factors gender and context:

# here, we only use fixed effects
lm.dummy.FE <- brm(
  pitch ~ gender * context,
  data = politeness_df,
  cores = 4,
  iter = 1000
)
lm.dummy.FE.coefs <- fixef(lm.dummy.FE)[,1] %>% as.numeric() # get the estimated coefficients
summary(lm.dummy.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: pitch ~ gender * context 
   Data: politeness_df (Number of observations: 83) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            260.60      8.26   243.90   275.93 1.00     1106     1406
gender2             -115.76     11.62  -138.42   -92.27 1.00      822     1084
contextpol           -27.05     11.77   -49.81    -2.97 1.00     1041     1188
gender2:contextpol    15.39     16.54   -18.71    47.66 1.00      832      843

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.18      2.89    31.12    42.18 1.00     2375     1461

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Now how do we interpret the estimated coefficients?

Let us recall the regression equation that is hidden behind this output: \[y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + \beta_3*x_1x_2\]

In order to help us interpret the output, R assigns string names to the estimated coefficients using the names we used in the generic formula. The (Intercept) corresponds to \(\beta_0\), genderM corresponds to \(\beta_1\), contextpol corresponds to \(\beta_2\) and genderM:contextpol (the interaction term) to \(\beta_3\).

Further, let us recall the numerical coding of our variables: for \(x_1\) (gender) a 0 means female, a 1 means male; for \(x_2\) (context) a 0 means informal, a 1 means polite. So the computed values are the estimates for conditions differing from the respective reference conditions - i.e. when the respective \(x\) is a 1.

To get an estimate of a certain design cell (\(y_i\)) - let’s start with the mean pitch of female speakers (0 for \(x_1\)) in informal contexts (0 for \(x_2\)) - we just insert the corresponding numeric values for the corresponding \(x\) and the estimated value for the corresponding \(\beta\). Thus we get:

y1 = lm.dummy.FE.coefs[1] + lm.dummy.FE.coefs[2]*0 +
  lm.dummy.FE.coefs[3]*0 + lm.dummy.FE.coefs[4]*(0)
y1
[1] 260.6013

Hence, the mean pitch of female speakers in informal context corresponds to the intercept. As a sanity check, we can recall that for dummy coded variables the model intercept is just the mean of the reference cell (in our case, female speakers in informal contexts!).

Let’s now calculate the mean pitch of male speakers (1 for \(x_1\)) in informal contexts (0 for \(x_2\)):

y2 = lm.dummy.FE.coefs[1] + lm.dummy.FE.coefs[2]*1 +
  lm.dummy.FE.coefs[3]*0 + lm.dummy.FE.coefs[4]*(1*0)
y2
[1] 144.8391

Simple (Contrast) Coding

Another common coding scheme is the simple coding (also called contrast coding). Simple coded variables are also compared to a reference level (just like dummy-coded ones). However, the intercept of a simple coded model is the grand mean – the mean of all cells (i.e. the mean of female-informal & female-polite & male-informal & male-polite cells).

Generally, this kind of coding can be created by subtracting \(1/k\) from the dummy coding contrast matrix, where \(k\) is the number of levels a variable has (in our case, both have two). Hence, the reference level will always only have negative values in the contrast matrix. The general rule is that the contrasts within a column have to add up to 0. R does not provide a built-in function for simple coding, but we can easily create the respective matrix ourselves by subtracting \(1/k\) (i.e. 1/2) from the dummy coding matrix:

# manual creation of contrasts
contr.matrix <- matrix( rep(0.5, 2))
dummy.matrix <- contr.treatment(2)
contr.coding <- dummy.matrix - contr.matrix

# we should duplicate the values to not overwrite previous contrasts
politeness_df <- politeness_df %>%
  mutate(context_contr = context,
         gender_contr = gender)
contrasts(politeness_df$context_contr) <- contr.coding
contrasts(politeness_df$gender_contr)  <- contr.coding

Hence now the gender is coded as -0.5 for female and 0.5 for male; context is coded as -0.5 for informal and 0.5 for polite.

Let’s again look at our regression model:

lm.contr.FE <- brm(
  pitch ~ gender_contr * context_contr,
  data = politeness_df,
  cores = 4,
  iter =  1000
)
lm.contr.FE.coefs <- fixef(lm.contr.FE)[,1] %>% as.numeric() # get vector of estimated coefficients
summary(lm.contr.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: pitch ~ gender_contr * context_contr 
   Data: politeness_df (Number of observations: 83) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                      192.83      4.04   184.91   200.43 1.00     2433
gender_contr2                 -108.34      7.95  -123.08   -92.79 1.00     2346
context_contr2                 -19.73      7.65   -33.97    -4.72 1.00     3264
gender_contr2:context_contr2    15.85     15.77   -15.16    47.94 1.00     2310
                             Tail_ESS
Intercept                        1285
gender_contr2                    1535
context_contr2                   1653
gender_contr2:context_contr2     1666

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.21      3.00    30.92    42.56 1.00     2337     1577

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

In order to compute the mean pitch of a specific cell, we proceed just as with dummy-coded variables and insert the respective estimates and values for \(x\). Let us start with female speakers (\(x_1\) is -0.5) in informal contexts (\(x_2\) is -0.5):

y1 = lm.contr.FE.coefs[1] + lm.contr.FE.coefs[2]*(-0.5) + lm.contr.FE.coefs[3]*(-0.5) + lm.contr.FE.coefs[4]*(-0.5)*(-0.5)
y1
[1] 260.8351

We get the same result as before (as we should - the estimates should not depend on a coding scheme but only on the data). As a sanity check, we can again look at the intercept – it matches the grand mean we computed in the beginning of this tutorial – as it should.

Your turn! Compute the pitch means for the other three conditions.

# your code here

Deviation (Sum) Coding

Deviation coding (also called sum coding) is the most popular coding scheme and is often considered the best choice to get a clear picture of presence (or absence) of an effect and a clear random effects interpretation.

It is slightly different from the previous schemes. It compares the mean of the predicted variable for a specific condition to the grand mean. So the estimates do not tell you the difference between the reference level and another level anymore. The intercept of linear models with sum coded variables is the grand mean.

R has a built-in function for creating sum coded variables:

# again create a new variable
politeness_df %>%
  mutate(context_dev = context,
         gender_dev = gender) -> politeness_df
contrasts(politeness_df$context_dev) <- contr.sum(2) # insert number of levels
contrasts(politeness_df$gender_dev)  <- contr.sum(2)

Now the gender is coded as s 1 for female and -1 for male; context is coded as 1 for informal and -1 for polite.

Below we fit a model with the sum-coded variables:

lm.dev.FE <- brm(pitch ~ context_dev * gender_dev,
                data = politeness_df,
                cores = 4,
                iter = 1000)
lm.dev.FE.coefs <- fixef(lm.dev.FE)[,1] %>% as.numeric()
summary(lm.dev.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: pitch ~ context_dev * gender_dev 
   Data: politeness_df (Number of observations: 83) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                  192.81      3.96   184.79   200.85 1.00     2544
context_dev1                 9.72      4.16     1.37    18.08 1.00     2389
gender_dev1                 54.08      4.12    46.13    61.82 1.00     2431
context_dev1:gender_dev1     3.93      3.98    -3.89    11.49 1.00     2477
                         Tail_ESS
Intercept                    1203
context_dev1                 1337
gender_dev1                  1348
context_dev1:gender_dev1     1558

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.32      2.95    31.02    42.36 1.00     2258     1428

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The coefficients denote now the difference between the grand mean (i.e. intercept) and the mean of the respective condition.

We apply the same idea to estimate the pitch means for specific cases: E.g. for female speakers in informal contexts we do:

y1 = lm.dev.FE.coefs[1] + lm.dev.FE.coefs[2]*1 + lm.dev.FE.coefs[3]*1 + lm.dev.FE.coefs[4]*1*1
y1
[1] 260.5424

Since the intercept is now the grand mean and not a specific reference level, let us think about the interpretation of the single estimates. The estimate of e.g. the context effect now denotes the value by which the mean pitch in informal (estimate * 1, remember our coding!) or polite contexts (estimate * -1) differs from the grand mean. So if we wish to calculate the mean pitch in polite contexts (across genders), we would do:

yPol = lm.dev.FE.coefs[1] + lm.dev.FE.coefs[2] * (-1)
yPol
[1] 183.0851

This means that the single estimates are in some sense ‘independent’ of each other (in contrast to e.g. dummy-coded variables where the estimates are bound to the reference levels of two variables) and give us insight if a specific factor is credibly different from 0. Similarly, if we wish to calculate the mean pitch of male speakers, we would calculate:

yM = lm.dev.FE.coefs[1] + lm.dev.FE.coefs[3] * (-1)
yM
[1] 138.7232

Helmert Coding

In this coding scheme, a level of a variable is compared to its subsequent levels. In our dataset, e.g. for gender the level female is compared to the subsequent level male.

Generally, to create such a coding, in order to compare the first level to the subsequent levels you would assign \((k-1)/k\) to the first level and \(-1/k\) to all subsequent levels where \(k\) is the total number of levels. To compare the second level to the subsequent levels you would assign 0 to the first level, \((i-1)/i\) to the second level and \(-i/1\) to all subsequent levels where \(i = k-1\) and so on. The difference of this coding scheme to previous ones is more clear for variales with >2 levels (see below). The intercept of a linear model corresponds to the grand mean.

R does not have a built-in function for standard Helmert coding, so we do it manually:

# with politeness data
helm.matrix <- matrix(c(0.5, -0.5))
politeness_df <-
 politeness_df %>%
 mutate(gender_helm = gender,
         context_helm = context)
contrasts(politeness_df$gender_helm)  <- helm.matrix
contrasts(politeness_df$context_helm) <- helm.matrix

The linear model looks like this:

lm.helmert.FE <- brm(pitch ~ context_helm * gender_helm,
                data = politeness_df,
                cores = 4,
                iter = 1000)
lm.helmert.FE.coefs <- fixef(lm.helmert.FE)[,1] %>% as.numeric()
summary(lm.helmert.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: pitch ~ context_helm * gender_helm 
   Data: politeness_df (Number of observations: 83) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                    193.01      3.95   185.03   200.73 1.00     2453
context_helm1                 19.30      8.03     3.82    35.68 1.00     2830
gender_helm1                 108.45      7.84    93.06   123.73 1.00     2252
context_helm1:gender_helm1    16.34     16.76   -16.66    49.85 1.00     2303
                           Tail_ESS
Intercept                      1683
context_helm1                  1691
gender_helm1                   1572
context_helm1:gender_helm1     1501

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.26      2.99    30.84    42.68 1.00     2262     1430

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The contrast estimate for the first level and the remaining levels is calculated by taking the mean of the dependent variable for the first level and subtracting the mean of the dependent variable for the remaining levels (in our case, just the mean of the second level). In other words, if we look at the context coefficient it denotes the difference between the mean of the polite and informal context means.

Mixing Coding Schemes

If you have several categorical predictor variables, it is also possible (and often useful!) to use different coding schemes for the different variables. It might, for example, make sense to use dummy coding for a variable which has a control and a treatment condition, and to use e.g. simple coding for a variable which has two ‘equal’ levels.

For example, we could use dummy-coding for context and simple-coding for gender.

When you mix coding schemes or define your own schemes there might be no pre-defined answers to questions as to what the sigle coefficients or the intercept mean. But knowing how the different schemes work, you can easily find this out!

Let us explore how the interpretation of the model changes if we mix coding schemes:

lm.mixedCode.FE <- brm(pitch ~ context * gender_contr,
                data = politeness_df,
                cores = 4,
                iter = 1000)
lm.mixedCode.FE.coefs <- fixef(lm.mixedCode.FE)[,1] %>% as.numeric()
summary(lm.mixedCode.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: pitch ~ context * gender_contr 
   Data: politeness_df (Number of observations: 83) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                  202.51      5.48   192.32   213.72 1.00     2463
contextpol                 -19.42      7.96   -34.89    -3.92 1.00     1949
gender_contr2             -116.24     11.12  -138.49   -95.16 1.00     1970
contextpol:gender_contr2    16.23     16.07   -14.57    47.91 1.00     1831
                         Tail_ESS
Intercept                    1436
contextpol                   1295
gender_contr2                1679
contextpol:gender_contr2     1619

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    36.18      2.96    30.95    42.45 1.00     2394     1532

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Generally, the interpretation is just the combination of what we have learned about the individual coding schemes. Recall that the intercept of a dummy-coded model is the mean of the reference level – since we dummy-coded context, the refernce level would be informal context. But it is not the intercept yet! We have the second predictor in our model – the simple coded gender. In simple coded models the intercept is the mean across the levels of the variable. Now the intercept of our model with the two different predictors is the mean pitch in informal contexts - across genders.

Following this logic, the context estimate denotes the difference between the informal and polite contexts - still across genders. The gender estimate denoted the difference between the mean pitch and female speakers if multiplied by the value -0.5 (recall our coding above); and the mean pitch and male speakers if multiplied by the value 0.5.

Variables with three levels

In order to see the differences yielded by the different coding schemes more clearly, it is usefut to look at an example datas et containing predictors with three levels. But don’t worry, all the concepts we have learned so far work just the same way - independently of the number of levels a predictor has.

The data set latinsquare is provided in the package languageR which you need to install in order to access it.

# alternative dataset with a three-level variable
require(languageR)
latinsquare_df <- as_tibble(latinsquare)
head(latinsquare_df)
# A tibble: 6 × 6
  Group Subject Word     RT SOA   List 
  <fct> <fct>   <fct> <int> <fct> <fct>
1 G1    S1      W1      532 short L1   
2 G1    S2      W1      542 short L1   
3 G1    S3      W1      615 short L1   
4 G1    S4      W1      547 short L1   
5 G2    S5      W9      553 short L3   
6 G2    S6      W9      464 short L3   

In this example data set, the reaction time (RT) of subjects in a lexical decision experiment was simulated. The predictors are, among others, the stimulus-onset asynchrony (SOA) (with the levels long, medium, short) and different lists of prime-target pairs of words (L1, L2, L3) that are presented to the participants. Without going into the details of the experimental paradign, let’s imagine we would like to predict the participants’ mean reaction time given the list type, the SOA type and their interaction.

Again, it is helpful to start by trying to get a rough idea of what the data looks like.

latinsquare_df %>%
ggplot(., aes(x = List, y = RT)) +
  geom_point( aes(color = SOA),
              position = position_jitterdodge(dodge.width=0.5))

We can also look at the empirical means of the different conditions:

latinsquare_df %>% group_by(List, SOA) %>% summarize(mean = mean(RT)) -> tibble_baseStats
tibble_baseStats
# A tibble: 9 × 3
# Groups:   List [3]
  List  SOA     mean
  <fct> <fct>  <dbl>
1 L1    long    530.
2 L1    medium  537.
3 L1    short   544.
4 L2    long    543.
5 L2    medium  556.
6 L2    short   534.
7 L3    long    529.
8 L3    medium  515.
9 L3    short   522.
mean(tibble_baseStats$mean)
[1] 534.5139

Now we can turn to modelling!

Dummy (Treatment) Coding

Generally, the coding schemes work in the very same way independently of the number of levels. The only difference to our old data set is that now we need two numeric variables coding the contrasts between the levels of one variable. If we look at the default (dummy) coding of e.g. the variable List we see a contrast matrix with two columns, each denoting the comparisons between two levels:

contrasts(latinsquare$SOA)
       medium short
long        0     0
medium      1     0
short       0     1
contrasts(latinsquare_df$List)
   L2 L3
L1  0  0
L2  1  0
L3  0  1

So now the recoding of the categorical variable takes two numeric variables: e.g. \(x_{1_2}\) and \(x_{1_3}\), where both can take the values 0 or 1; the single levels are denoted by the combination of the two numeric variables. Again there is a reference level - List 1 - described by \(x_{1_2}\) and \(x_{1_3}\) being 0. \(x_{1_2}\) being 1 describes the difference between the reference level and List 2; \(x_{1_3}\) being 1 describes the difference between the reference level and List 3. Correspondingly, there is and individual \(\beta\) for each numeric variable estimated in the regression model. The coding of the SOA factor works just the same way. The interactions between specific levels are described by combining the respective numeric variables \(x\). So the model we are fitting is described by:

\[y = \beta_0 + \beta_1 * x_{1_2} + \beta_2 * x_{1_3} + \beta_3 * x_{2_2} + \beta_4 * x{2_3} + \beta_5 * x_{1_2}x_{2_2} + \beta_6 * x_{1_3}x_{2_2} + \beta_7 * x_{1_2}x_{2_3} + \beta8 * x_{1_3}x_{2_3}\]

lm3.dummy.FE <- brm(RT ~ List * SOA,
                   data = latinsquare_df,
                   cores = 4,
                   iter = 1000)
lm3.dummy.FE.coefs <- fixef(lm3.dummy.FE)[,1] %>% as.numeric()
summary(lm3.dummy.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List * SOA 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept          529.61     11.36   507.89   552.76 1.01      789     1202
ListL2              13.14     16.44   -19.40    44.81 1.01      874      953
ListL3              -0.40     16.35   -33.13    30.09 1.01      903     1185
SOAmedium            6.87     16.41   -25.60    37.40 1.01      818     1075
SOAshort            13.64     16.40   -18.45    44.95 1.01      850     1101
ListL2:SOAmedium     6.65     23.22   -37.61    53.57 1.01      924      784
ListL3:SOAmedium   -20.96     22.80   -66.02    21.79 1.01      872     1184
ListL2:SOAshort    -21.61     23.23   -68.84    20.75 1.01      840     1214
ListL3:SOAshort    -20.46     22.89   -64.46    24.99 1.00     1017     1421

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.52      2.84    41.40    52.54 1.00     1876     1404

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Since both predictors are dummy-coded the intercept represents the reference level - the mean RT for List 1 and a long SOA. Following the same procedure as for the two-level variables you could calculate the estimated mean RTs for specific conditions.

Simple (Contrast) Coding

Simple coding only slightly differs from dummy-coding – the intercept of the model is the grand mean, not the mean RT of the reference level. Otherwise, the coefficients still denote the difference between the reference level and other specific levels.

latinsquare_df %>%
  mutate(List_contr = factor(List),
         SOA_contr = factor(SOA)) -> latinsquare_df
dummy.matrix3 <- contr.treatment(3)
contr.matrix3 <- matrix(c(1/3, 1/3, 1/3, 1/3, 1/3, 1/3), ncol=2)
contrasts(latinsquare_df$List_contr) <- dummy.matrix3 - contr.matrix3
contrasts(latinsquare_df$SOA_contr) <-  dummy.matrix3 - contr.matrix3
lm3.simple.FE <- brm(RT ~ List_contr * SOA_contr,
                   data = latinsquare_df,
                   cores = 4,
                   iter = 1000)
lm3.simple.FE.coefs <- fixef(lm3.simple.FE)[,1] %>% as.numeric()
summary(lm3.simple.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List_contr * SOA_contr 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                534.46      3.72   527.04   541.65 1.01     2125
List_contr2                7.81      9.36   -10.87    26.23 1.00     1743
List_contr3              -14.75      9.22   -33.54     2.55 1.00     1683
SOA_contr2                 2.01      9.39   -16.54    20.58 1.00     1754
SOA_contr3                -0.63      9.58   -18.85    18.39 1.00     1809
List_contr2:SOA_contr2     5.51     22.32   -37.79    49.93 1.00     1364
List_contr3:SOA_contr2   -21.79     23.57   -68.16    23.81 1.00     1630
List_contr2:SOA_contr3   -22.88     22.85   -68.18    22.88 1.00     1580
List_contr3:SOA_contr3   -21.87     23.85   -68.44    24.95 1.00     1486
                       Tail_ESS
Intercept                  1225
List_contr2                1629
List_contr3                1289
SOA_contr2                 1366
SOA_contr3                 1688
List_contr2:SOA_contr2     1459
List_contr3:SOA_contr2     1598
List_contr2:SOA_contr3     1527
List_contr3:SOA_contr3     1371

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.42      2.97    41.19    52.35 1.00     2278     1362

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Deviation (Sum) Coding

With increasing number of levels within the factors the complexity and messiness of interpreting the differences between levels against each other increases considerably. Hence it might make a lot of sense to use the deviation coding scheme which provides estimates of effects comapred to the grand mean. We again can use the R built-in function to assign deviation coding to our three-level factors:

latinsquare_df %>%
  mutate(List_dev = List,
         SOA_dev = SOA) -> latinsquare_df
contrasts(latinsquare_df$List_dev) <- contr.sum(3) # insert number of levels
contrasts(latinsquare_df$SOA_dev) <- contr.sum(3)

For e.g. SOA our numeric variables now denote the effect of long SOA compared to the grand mean when \(x_{2_2}\) is a 1 and \(x_{2_3}\) is a 0; they denote the effect of medium SOA compared to the grand mean when \(x_{2_2}\) is a 0 and \(x_{2_3}\) is a 1; the effect of short SOA is never compared to the grand mean since it is always assigned a -1.

lm3.dev.FE <- brm(RT ~ List_dev * SOA_dev,
                   data = latinsquare_df,
                 cores = 4,
                 iter = 1000)
lm3.dev.FE.coefs <- fixef(lm3.dev.FE)[,1] %>% as.numeric()
summary(lm3.dev.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List_dev * SOA_dev 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            534.44      4.06   526.59   542.17 1.00     2855     1438
List_dev1              2.16      5.59    -9.08    13.26 1.00     1843     1378
List_dev2             10.24      5.68    -1.03    21.43 1.00     2023     1473
SOA_dev1              -0.80      5.43   -11.32     9.57 1.00     2019     1705
SOA_dev2               1.77      5.44    -9.18    12.95 1.00     1935     1375
List_dev1:SOA_dev1    -6.68      7.90   -21.87     8.56 1.00     1700     1523
List_dev2:SOA_dev1    -1.07      7.62   -15.57    13.77 1.00     1506     1550
List_dev1:SOA_dev2    -1.74      7.66   -16.22    13.54 1.00     1639     1612
List_dev2:SOA_dev2    10.25      7.99    -5.43    25.73 1.00     1651     1624

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.61      2.77    41.54    52.39 1.00     2743     1771

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Think about what the single estimates mean!

Helmert Coding

For recap: In this coding scheme, a level of a variable is compared to its subsequent levels. What does this mean for the three-level factors?

latinsquare_df %>%
  mutate(List_helm = List,
         SOA_helm = SOA) -> latinsquare_df
helm.matrix3 <- matrix(c(2/3, -1/3, -1/3, 0, 1/2, -1/2 ), ncol = 2)
contrasts(latinsquare_df$List_helm) <- helm.matrix3
contrasts(latinsquare_df$SOA_helm) <- helm.matrix3
lm3.helm.FE <- brm(RT ~ List_helm * SOA_helm,
                   data = latinsquare_df,
                  cores = 4,
                  iter = 1000)
lm3.helm.FE.coefs <- fixef(lm3.helm.FE)[,1] %>% as.numeric()
summary(lm3.helm.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List_helm * SOA_helm 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept              534.52      3.75   527.02   541.90 1.00     3724
List_helm1               3.50      7.85   -12.25    18.85 1.00     4213
List_helm2              22.61      9.62     3.19    41.61 1.00     3414
SOA_helm1               -0.66      8.14   -16.34    15.47 1.00     4152
SOA_helm2                2.78      9.27   -15.41    20.26 1.00     4254
List_helm1:SOA_helm1   -14.19     17.40   -48.50    19.01 1.00     4911
List_helm2:SOA_helm1   -12.63     20.07   -51.14    26.18 1.00     4658
List_helm1:SOA_helm2   -14.35     20.56   -55.10    24.37 1.00     3622
List_helm2:SOA_helm2    28.63     23.36   -17.57    74.72 1.00     4086
                     Tail_ESS
Intercept                1609
List_helm1               1593
List_helm2               1336
SOA_helm1                1632
SOA_helm2                1308
List_helm1:SOA_helm1     1480
List_helm2:SOA_helm1     1617
List_helm1:SOA_helm2     1466
List_helm2:SOA_helm2     1456

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.50      2.82    41.36    52.28 1.01     3328     1396

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The main effect estimates denote the differences between the mean of List1 and the mean of (List2 + List3); and between the mean of List2 and the mean of List3. Respectively, they denote the differences between the mean of SOA long and the mean of (medium + short); and between the mean of SOA medium and the mean of short. The intercept is the grand mean.

Reverse Helmert Coding

The reverse Helmert coding scheme (also called difference coding) is quite similar to the Helmert coding, but compares the mean of a level to its previous levels. Since we basically reverse the coding we used in the previous scheme, we also ‘reverse’ the contrast matrix to create such a coding.

latinsquare_df %>%
  mutate(List_rhelm = List,
         SOA_rhelm = SOA) -> latinsquare_df
rhelm.matrix3 <- matrix(c(-1/2, 1/2, 0, -1/3, -1/3, 2/3 ), ncol = 2)
contrasts(latinsquare_df$List_rhelm) <- rhelm.matrix3
contrasts(latinsquare_df$SOA_rhelm) <- rhelm.matrix3
lm3.rhelm.FE <- brm(RT ~ List_rhelm * SOA_rhelm,
                   data = latinsquare_df,
                   iter = 1000,
                   cores = 4)
lm3.rhelm.FE.coefs <- fixef(lm3.rhelm.FE)[,1] %>% as.numeric()
summary(lm3.rhelm.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List_rhelm * SOA_rhelm 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept                534.67      3.77   527.30   541.88 1.00     3705
List_rhelm1                8.03      9.15   -10.11    25.75 1.00     4267
List_rhelm2              -18.54      8.14   -34.37    -1.97 1.00     3516
SOA_rhelm1                 2.10      9.42   -16.31    20.16 1.00     4381
SOA_rhelm2                -1.78      7.72   -17.62    13.25 1.00     3787
List_rhelm1:SOA_rhelm1     6.07     23.22   -39.41    51.21 1.01     3217
List_rhelm2:SOA_rhelm1   -23.75     20.15   -64.52    15.70 1.00     3619
List_rhelm1:SOA_rhelm2   -25.73     20.31   -64.91    11.85 1.01     3683
List_rhelm2:SOA_rhelm2     2.53     17.56   -31.29    36.84 1.00     3847
                       Tail_ESS
Intercept                  1476
List_rhelm1                1542
List_rhelm2                1332
SOA_rhelm1                 1575
SOA_rhelm2                 1651
List_rhelm1:SOA_rhelm1     1394
List_rhelm2:SOA_rhelm1     1344
List_rhelm1:SOA_rhelm2     1502
List_rhelm2:SOA_rhelm2     1597

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.50      2.85    41.29    52.60 1.01     3326     1376

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

In our example, the first estimate denotes the difference between the mean of List2 and the mean of List1; the second - the difference between the mean of List3 and the mean of (List1 + List2). The main effects of SOA can be interpreted similarly.

Mixed Schemes: Dummy and Deviation Coding

Just like with two-level factors, we might wish to use different coding schemes for different predictors. It might make sense to use dummy coding for a variable which has a control and two different treatment conditions, and to use deviation coding for a variable which has ‘equal’ levels.

For example, we could use dummy-coding for List and deviation-coding for SOA.

lm3.mixedCode.FE <- brm(RT ~ List * SOA_dev,
                   data = latinsquare_df,
                   cores = 4,
                   iter = 1000)
lm3.mixedCode.FE.coefs <- fixef(lm3.mixedCode.FE)[,1] %>% as.numeric()
summary(lm3.mixedCode.FE)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: RT ~ List * SOA_dev 
   Data: latinsquare_df (Number of observations: 144) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         536.89      6.69   523.92   549.97 1.00     2364     1659
ListL2              8.06      9.20   -10.53    25.72 1.00     2178     1350
ListL3            -14.52      9.50   -33.73     4.04 1.00     2621     1655
SOA_dev1           -7.01      9.67   -26.33    12.12 1.00     1129     1160
SOA_dev2           -0.24      9.60   -19.11    18.17 1.00     1035     1114
ListL2:SOA_dev1     5.30     13.54   -21.12    30.72 1.00     1214     1291
ListL3:SOA_dev1    13.81     13.89   -13.61    41.10 1.00     1343     1402
ListL2:SOA_dev2    12.14     13.64   -14.74    38.96 1.00     1081     1388
ListL3:SOA_dev2    -7.00     13.42   -33.44    19.08 1.00     1316     1391

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    46.44      2.89    40.98    52.46 1.00     2364     1476

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Towards the end of this tutorial, the main take-away is this: when you look at the estimates of (any) model, you could ask yourself a couple of questions like these to make sure you understand what was calculated:

  • What does the intercept represent?
  • What do the single estimates mean?
  • What do they tell me about my hypotheses?

Of course, you will also encounter experimental designs which use a two-level and a three-level categorical predictors – but the conceptual basics regarding how to choose the contrasts and how to interpret linear models are the same.

Conclusion

This tutorial provides an outline of most common coding schemes, but there are also others out there. However, having a basic understanding of the procedures described above, you are perfectly prepared to deal with a broad spectrum of experimental data in psychology and other disciplines. There are heuristics as to which coding schemes to choose; but most importantly, you should choose statistical tools that are the best fit for your specific research question. It is often helpful to write up the results of a pilot model to see if the interpretation of the results makes clear statements about your hypothesis.

References

P. Buerckner, 2016. “brms: An R package for Bayesian multilevel models using Stan”. In Journal of Statistical Software 80.1, pp. 1–28.

J. G. W. Raaijmakers et al., 1999. How to Deal with “The Language-as-Fixed-Effect Fallacy: Common Misconceptions and Alternative Solutions. In Journal of Memory and Language

B. Winter, S. Grauwunder, 2012. “The Phonetic Profile of Korean Formality”. In Journal of Phonetics 40, pp. 808–815.