Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

overview

hierarchical regression / mixed effects
maximal vs. parsimonious mixed effects structures
cross-validation, loo and information criteria

mixed effects

motivating example

mixedEffects

Barr et al. (2013) Random effects structure in mixed-effects models: Keep it maximal JML

mixed effects

Fixed effects (main effects, population-level effects) are, roughly, the effects associated with our experimental intervention. We expect fixed effects to be exactly recreatable in a repetition of the experiment.

Mixed effects (random effects, group-level effects) capture additional variation through randomness in the realization of the experiment (e.g., randomness from the subjects we sampled or the items we used).

There are two types and one aspect of mixed effects of relevance:

random intercepts
- variance in mean between subjects ::: "by-subject varying intercepts"
- variance in mean between items ::: "by-item varying intercepts"
random slopes
- variance in main effects between subjects ::: "by-subject varying slopes"
- variance in main effects between items ::: "by-item varying slopes"
correlations between 1 and 2

case study

background on example data

in most languages subject relative clauses
are easier than object relative clauses
but Chinese seems to be an exception

subject relative clause

The senator who interrogated the journalist …

object relative clause

The senator who the journalist interrogated …

Sorensen, Hohenstein, Vasishth (2016) Bayesian linear mixed models using Stan QMP

data: self-paced reading times

37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task

rt_data = readr::read_delim('data/06_gibsonwu2012data.txt', delim = " ") %>% 
  filter(region == "headnoun") %>% 
  mutate(so = ifelse(type == "subj-ext", "-1", "1")) %>% 
  select(subj, item, so, rt)
head(rt_data)

## # A tibble: 6 x 4
##    subj  item so       rt
##   <dbl> <dbl> <chr> <dbl>
## 1     1    13 1      1561
## 2     1     6 -1      959
## 3     1     5 1       582
## 4     1     9 1       294
## 5     1    14 -1      438
## 6     1     4 -1      286

\(\Leftarrow\) contrast coding of categorical predictor so

data from Gibson & Wu (2013)

inspect data

rt_data %>% group_by(so) %>% 
  summarize(mean_log_rt = rt%>%log%>%mean)

## # A tibble: 2 x 2
##   so    mean_log_rt
##   <chr>       <dbl>
## 1 -1           6.10
## 2 1            6.02

rt_data %>% 
ggplot(aes(x = so, y = log(rt))) + 
  geom_violin() + 
  geom_point(position = "jitter", 
             color = "skyblue")

fixed effects model

predict log-reading times as affected by treatment so
assume improper priors for fixed effects coefficients

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{Student\text{-}T}(3,0, 10)T(0,\infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_1 & \sim \mathcal{U}(- \infty, \infty) \\ \beta_0 & \sim \mathcal{Student\text{-}T}( \dots ) \end{align*} \]

FE = brms::brm(log(rt) ~ so, data = rt_data) # assumes improper priors per default

fixed effects model: results

summary(FE)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: log(rt) ~ so 
##    Data: rt_data (Number of observations: 547) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     6.10      0.04     6.03     6.17       3638 1.00
## so1          -0.08      0.05    -0.18     0.02       3680 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.60      0.02     0.57     0.64       3767 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

varying intercepts model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{Student\text{-}T}(3,0, 10)T(0,\infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_1 & \sim \mathcal{U}(- \infty, \infty) \\ \beta_0 & \sim \mathcal{Student\text{-}T}( \dots ) \end{align*} \]

VarInt = brms::brm(log(rt) ~ (1 | subj + item) + so, data = rt_data) # interc. cond. on `subj` & `item`

varying intercepts model: results

summary(VarInt)$fixed

##              Estimate  Est.Error   l-95% CI   u-95% CI Eff.Sample
## Intercept  6.09923790 0.07561936  5.9484681 6.24571616   1295.830
## so1       -0.07110043 0.04545781 -0.1613091 0.01584177   6737.225
##                Rhat
## Intercept 1.0034653
## so1       0.9997101

summary(VarInt)$random

## $item
##                Estimate  Est.Error  l-95% CI  u-95% CI Eff.Sample     Rhat
## sd(Intercept) 0.2025315 0.05247435 0.1237281 0.3279515   1215.134 1.001315
## 
## $subj
##                Estimate Est.Error  l-95% CI  u-95% CI Eff.Sample     Rhat
## sd(Intercept) 0.2519749 0.0401475 0.1805333 0.3407266   1452.592 1.001043

varying intercepts & slopes model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so manipulation

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \end{align*} \]

VarIntSlo = brms::brm(log(rt) ~ so + (1 + so || subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# double || means: no correlation

varying intercepts & slopes model: results

summary(VarIntSlo)$fixed

##              Estimate  Est.Error   l-95% CI  u-95% CI Eff.Sample      Rhat
## Intercept  6.09897718 0.07657434  5.9531126 6.2508252   2181.359 1.0010255
## so1       -0.07137108 0.05260568 -0.1785197 0.0301434   5252.381 0.9993491

summary(VarIntSlo)$random

## $item
##                 Estimate  Est.Error    l-95% CI  u-95% CI Eff.Sample
## sd(Intercept) 0.19677301 0.05024977 0.116561353 0.3120138   1736.560
## sd(so1)       0.07415912 0.05458555 0.003018347 0.2031322   1692.854
##                   Rhat
## sd(Intercept) 1.000953
## sd(so1)       1.001259
## 
## $subj
##                 Estimate  Est.Error    l-95% CI  u-95% CI Eff.Sample
## sd(Intercept) 0.25530898 0.04042325 0.184988855 0.3439405   1515.931
## sd(so1)       0.06594608 0.04790860 0.002615597 0.1796807   1323.166
##                   Rhat
## sd(Intercept) 1.000782
## sd(so1)       1.003443

varying intercepts & slopes model with correlation

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so manipulation
assume that random intercepts and slopes might be correlated

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) & \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \end{align*} \]

MaxME = brms::brm(log(rt) ~ so + (1 + so | subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# single | means: correlation between `subj` and `item` random effects

varying intercepts & slopes model with correlation: results

summary(MaxME)$fixed

##              Estimate  Est.Error   l-95% CI  u-95% CI Eff.Sample     Rhat
## Intercept  6.09604948 0.07810505  5.9332789 6.2471472   1648.000 1.003476
## so1       -0.07194685 0.05432166 -0.1807734 0.0364145   4318.558 1.000453

varying intercepts & slopes model with correlation: results

summary(MaxME)$random

## $item
##                       Estimate  Est.Error     l-95% CI  u-95% CI
## sd(Intercept)       0.20565843 0.05746807  0.117118868 0.3443971
## sd(so1)             0.06888409 0.05528479  0.002535385 0.2031245
## cor(Intercept,so1) -0.04863071 0.53578337 -0.936757183 0.9095697
##                    Eff.Sample      Rhat
## sd(Intercept)        1710.324 1.0017317
## sd(so1)              1621.609 0.9992771
## cor(Intercept,so1)   4237.833 1.0003288
## 
## $subj
##                      Estimate  Est.Error    l-95% CI  u-95% CI Eff.Sample
## sd(Intercept)       0.2982016 0.05125165  0.20982175 0.4098908   1935.553
## sd(so1)             0.1274528 0.06383748  0.01245766 0.2550966   1360.811
## cor(Intercept,so1) -0.6755960 0.32072761 -0.99135634 0.2751746   2977.401
##                        Rhat
## sd(Intercept)      1.001748
## sd(so1)            1.001602
## cor(Intercept,so1) 1.000139

which random effect structure?

maximal vs. parsimonious RE structures

maximal RE structrure

Barr et al. (2013) recommend using the maximal RE structure licensed by the design (which converges) for confirmatory hypothesis testing.

parsimonious RE structrure

Bates et al. (2016) recommend using, roughly, the maximal RE structure which is necessary for explaining variance in the data (e.g., maximal RE structure justified by the data).

parsimonious RE structure for the case study

summary(MaxME)$random

## $item
##                       Estimate  Est.Error     l-95% CI  u-95% CI
## sd(Intercept)       0.20565843 0.05746807  0.117118868 0.3443971
## sd(so1)             0.06888409 0.05528479  0.002535385 0.2031245
## cor(Intercept,so1) -0.04863071 0.53578337 -0.936757183 0.9095697
##                    Eff.Sample      Rhat
## sd(Intercept)        1710.324 1.0017317
## sd(so1)              1621.609 0.9992771
## cor(Intercept,so1)   4237.833 1.0003288
## 
## $subj
##                      Estimate  Est.Error    l-95% CI  u-95% CI Eff.Sample
## sd(Intercept)       0.2982016 0.05125165  0.20982175 0.4098908   1935.553
## sd(so1)             0.1274528 0.06383748  0.01245766 0.2550966   1360.811
## cor(Intercept,so1) -0.6755960 0.32072761 -0.99135634 0.2751746   2977.401
##                        Rhat
## sd(Intercept)      1.001748
## sd(so1)            1.001602
## cor(Intercept,so1) 1.000139

posteriors for by-subject and by-item sd(so1), as well as both correlation coefficients seem rather diffuse and (almost) include the zero point
maybe these are not needed to explain the data? \(\leadsto\) model comparison

a recipe for balancing parsimony (in Bayesian analyses)

[ caveat: this is only one possibile recipe]

fit (if possible) a maximal model
inspect posteriors and identify parameters whose posterior estimates appear not to be conclusively different from whatever their value would be in a simpler model that omitted these parameters (e.g., sd=0 or cor = 0)
remove these parameters one-by-one and compare each reduced model to the previous

never follow a recipe

"There is no cook-book substitute for theoretical considerations and developing statistical understanding. Each data set deserves the exercise of judgement on part of the researcher."

Bates et al. (2016) Parsimonious Mixed Models preprint

(leave-one-out) cross-validation

motivation

Bayes factors are conceptually great, but they compare models ex ante and are computationally heavy.

When comparing many complex hierarchical models, e.g., for parsimony of RE structures, we might want ex post comparisons that are very efficient to compute / approximate.

information criteria

deviance score

\[ D(y^{(\text{rep})}, \theta) = -2 \log(P(y\mid \theta)) \ \ \ \ \ \ \ \ \ \ \ \ \text{(lower is better)} \]

un-Bayesian IC

\[ \begin{align*} \text{AIC} & = D(y^{(\text{rep})}, \theta^*) + 2 p_{\text{AIC}} & \text{with } \theta^* \in \arg\max_\theta P(y \mid \theta) \text{ and } p_{\text{AIC}} = \text{no. free parameters} \end{align*} \]

pseudo-Bayesian IC

\[ \begin{align*} \text{DIC} & = D(y^{(\text{rep})}, \bar{\theta}) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = \frac{1}{2} Var_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) \end{align*} \]

rather Bayesian IC

\[ \begin{align*} \text{WAIC} & = E_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) + 2 p_{\text{DIC}} \end{align*} \]

fix a model with \(P(Y \mid \theta)\) & \(P(\theta)\); \(y\) is the data set for conditioning; \(y^{(\text{rep})}\) is new data or \(y^{(\text{rep})} = y\)

leave-one-out cross-validation

log point-wise density

how (log-)likely is each (new) datum \(y^{(\text{rep})}_i\) under the posterior predictive distribution given \(y\)?

\[ \begin{align*} \text{LPD} & = \sum_{i=1}^n \log P(y^{(\text{rep})}_i \mid y) \ \ \ \ = \sum_{i=1}^n \log \int P(y^{(\text{rep})}_i \mid \theta) \ P(\theta \mid y) \ \text{d}\theta \\ & \approx \sum_{i=1}^n \log \left ( \frac{1}{S} \sum_{s = 1}^S P(y^{(\text{rep})}_i \mid \theta^s) \right ) \ \ \ \ \ \ \theta^s \sim P(\theta \mid y) \ \ \ \text{(from MCMC sampling)} \end{align*} \]

leave-one-out cross-validation

how (log-)likely is each old datum \(y_i\) under the posterior predictive distribution given \(y_{-i}\)?

\[ \text{LOO} = \sum_{i=1}^n \log P(y_i \mid y_{-i}) \ \ \ \ = \sum_{i=1}^n \log \int P(y_i \mid \theta) \ P(\theta \mid y_{-i}) \ \text{d}\theta \]

package `loo`

efficiently estimates LOO score (with estimate of standard error) from raw MCMC output

loo_scores = loo(FE, VarInt, VarIntSlo, MaxME)
show(loo_scores)

##                     LOOIC    SE
## FE                 995.41 49.72
## VarInt             881.79 53.17
## VarIntSlo          883.00 53.23
## MaxME              882.02 52.63
## FE - VarInt        113.62 21.56
## FE - VarIntSlo     112.41 21.60
## FE - MaxME         113.39 21.82
## VarInt - VarIntSlo  -1.21  1.63
## VarInt - MaxME      -0.23  4.03
## VarIntSlo - MaxME    0.98  3.53

NB:

brms::kfold implements \(k\)-fold cross-validation
loo uses clever approximation and is leagues faster than kfold with \(k=1\)

wrap-up

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

mixed effects

There are two types and one aspect of mixed effects of relevance:

random intercepts
- variance in mean between subjects ::: "by-subject varying intercepts"
- variance in mean between items ::: "by-item varying intercepts"
random slopes
- variance in main effects between subjects ::: "by-subject varying slopes"
- variance in main effects between items ::: "by-item varying slopes"
correlations between 1 and 2

overview

overview

mixed effects

motivating example

mixed effects

case study

background on example data

data: self-paced reading times

inspect data

fixed effects model

fixed effects model: results

varying intercepts model

varying intercepts model: results

varying intercepts & slopes model

varying intercepts & slopes model: results

varying intercepts & slopes model with correlation

varying intercepts & slopes model with correlation: results

varying intercepts & slopes model with correlation: results

which random effect structure?

maximal vs. parsimonious RE structures

parsimonious RE structure for the case study

a recipe for balancing parsimony (in Bayesian analyses)

never follow a recipe

(leave-one-out) cross-validation

motivation

information criteria

leave-one-out cross-validation

package loo

wrap-up

generalized linear model

common link & likelihood function

mixed effects

package `loo`