\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
Barr et al. (2013) Random effects structure in mixed-effects models: Keep it maximal JML
Fixed effects (main effects, population-level effects) are, roughly, the effects associated with our experimental intervention. We expect fixed effects to be exactly recreatable in a repetition of the experiment.
Mixed effects (random effects, group-level effects) capture additional variation through randomness in the realization of the experiment (e.g., randomness from the subjects we sampled or the items we used).
There are two types and one aspect of mixed effects of relevance:
in most languages
subject relative clauses
are easier than
object relative clauses
but Chinese seems to be an exception
subject relative clause
The senator who interrogated the journalist …
object relative clause
The senator who the journalist interrogated …
Sorensen, Hohenstein, Vasishth (2016) Bayesian linear mixed models using Stan QMP
37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task
rt_data = readr::read_delim('data/06_gibsonwu2012data.txt', delim = " ") %>% filter(region == "headnoun") %>% mutate(so = ifelse(type == "subj-ext", "-1", "1")) %>% select(subj, item, so, rt) head(rt_data)
## # A tibble: 6 x 4 ## subj item so rt ## <dbl> <dbl> <chr> <dbl> ## 1 1 13 1 1561 ## 2 1 6 -1 959 ## 3 1 5 1 582 ## 4 1 9 1 294 ## 5 1 14 -1 438 ## 6 1 4 -1 286
\(\Leftarrow\)
contrast coding of categorical predictor so
data from Gibson & Wu (2013)
rt_data %>% group_by(so) %>% summarize(mean_log_rt = rt%>%log%>%mean)
## # A tibble: 2 x 2 ## so mean_log_rt ## <chr> <dbl> ## 1 -1 6.10 ## 2 1 6.02
rt_data %>% ggplot(aes(x = so, y = log(rt))) + geom_violin() + geom_point(position = "jitter", color = "skyblue")
predict log-reading times as affected by treatment so
assume improper priors for fixed effects coefficients
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{Student\text{-}T}(3,0, 10)T(0,\infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_1 & \sim \mathcal{U}(- \infty, \infty) \\ \beta_0 & \sim \mathcal{Student\text{-}T}( \dots ) \end{align*} \]
FE = brms::brm(log(rt) ~ so, data = rt_data) # assumes improper priors per default
summary(FE)
## Family: gaussian ## Links: mu = identity; sigma = identity ## Formula: log(rt) ~ so ## Data: rt_data (Number of observations: 547) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## Intercept 6.10 0.04 6.03 6.17 3638 1.00 ## so1 -0.08 0.05 -0.18 0.02 3680 1.00 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sigma 0.60 0.02 0.57 0.64 3767 1.00 ## ## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample ## is a crude measure of effective sample size, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{Student\text{-}T}(3,0, 10)T(0,\infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_1 & \sim \mathcal{U}(- \infty, \infty) \\ \beta_0 & \sim \mathcal{Student\text{-}T}( \dots ) \end{align*} \]
VarInt = brms::brm(log(rt) ~ (1 | subj + item) + so, data = rt_data) # interc. cond. on `subj` & `item`
summary(VarInt)$fixed
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample ## Intercept 6.09923790 0.07561936 5.9484681 6.24571616 1295.830 ## so1 -0.07110043 0.04545781 -0.1613091 0.01584177 6737.225 ## Rhat ## Intercept 1.0034653 ## so1 0.9997101
summary(VarInt)$random
## $item ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sd(Intercept) 0.2025315 0.05247435 0.1237281 0.3279515 1215.134 1.001315 ## ## $subj ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sd(Intercept) 0.2519749 0.0401475 0.1805333 0.3407266 1452.592 1.001043
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so
manipulation
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \end{align*} \]
VarIntSlo = brms::brm(log(rt) ~ so + (1 + so || subj + item), data = rt_data) # intercept and slope for `so` varies with values for `subj` and `item` # double || means: no correlation
summary(VarIntSlo)$fixed
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## Intercept 6.09897718 0.07657434 5.9531126 6.2508252 2181.359 1.0010255 ## so1 -0.07137108 0.05260568 -0.1785197 0.0301434 5252.381 0.9993491
summary(VarIntSlo)$random
## $item ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample ## sd(Intercept) 0.19677301 0.05024977 0.116561353 0.3120138 1736.560 ## sd(so1) 0.07415912 0.05458555 0.003018347 0.2031322 1692.854 ## Rhat ## sd(Intercept) 1.000953 ## sd(so1) 1.001259 ## ## $subj ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample ## sd(Intercept) 0.25530898 0.04042325 0.184988855 0.3439405 1515.931 ## sd(so1) 0.06594608 0.04790860 0.002615597 0.1796807 1323.166 ## Rhat ## sd(Intercept) 1.000782 ## sd(so1) 1.003443
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so
manipulation
assume that random intercepts and slopes might be correlated
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) & \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \end{align*} \]
MaxME = brms::brm(log(rt) ~ so + (1 + so | subj + item), data = rt_data) # intercept and slope for `so` varies with values for `subj` and `item` # single | means: correlation between `subj` and `item` random effects
summary(MaxME)$fixed
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## Intercept 6.09604948 0.07810505 5.9332789 6.2471472 1648.000 1.003476 ## so1 -0.07194685 0.05432166 -0.1807734 0.0364145 4318.558 1.000453
summary(MaxME)$random
## $item ## Estimate Est.Error l-95% CI u-95% CI ## sd(Intercept) 0.20565843 0.05746807 0.117118868 0.3443971 ## sd(so1) 0.06888409 0.05528479 0.002535385 0.2031245 ## cor(Intercept,so1) -0.04863071 0.53578337 -0.936757183 0.9095697 ## Eff.Sample Rhat ## sd(Intercept) 1710.324 1.0017317 ## sd(so1) 1621.609 0.9992771 ## cor(Intercept,so1) 4237.833 1.0003288 ## ## $subj ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample ## sd(Intercept) 0.2982016 0.05125165 0.20982175 0.4098908 1935.553 ## sd(so1) 0.1274528 0.06383748 0.01245766 0.2550966 1360.811 ## cor(Intercept,so1) -0.6755960 0.32072761 -0.99135634 0.2751746 2977.401 ## Rhat ## sd(Intercept) 1.001748 ## sd(so1) 1.001602 ## cor(Intercept,so1) 1.000139
maximal RE structrure
Barr et al. (2013) recommend using the maximal RE structure licensed by the design (which converges) for confirmatory hypothesis testing.
parsimonious RE structrure
Bates et al. (2016) recommend using, roughly, the maximal RE structure which is necessary for explaining variance in the data (e.g., maximal RE structure justified by the data).
summary(MaxME)$random
## $item ## Estimate Est.Error l-95% CI u-95% CI ## sd(Intercept) 0.20565843 0.05746807 0.117118868 0.3443971 ## sd(so1) 0.06888409 0.05528479 0.002535385 0.2031245 ## cor(Intercept,so1) -0.04863071 0.53578337 -0.936757183 0.9095697 ## Eff.Sample Rhat ## sd(Intercept) 1710.324 1.0017317 ## sd(so1) 1621.609 0.9992771 ## cor(Intercept,so1) 4237.833 1.0003288 ## ## $subj ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample ## sd(Intercept) 0.2982016 0.05125165 0.20982175 0.4098908 1935.553 ## sd(so1) 0.1274528 0.06383748 0.01245766 0.2550966 1360.811 ## cor(Intercept,so1) -0.6755960 0.32072761 -0.99135634 0.2751746 2977.401 ## Rhat ## sd(Intercept) 1.001748 ## sd(so1) 1.001602 ## cor(Intercept,so1) 1.000139
sd(so1)
, as well as both correlation coefficients seem rather diffuse and (almost) include the zero point[ caveat: this is only one possibile recipe]
sd=0
or cor = 0
)
"There is no cook-book substitute for theoretical considerations and developing statistical understanding. Each data set deserves the exercise of judgement on part of the researcher."
Bates et al. (2016) Parsimonious Mixed Models preprint
Bayes factors are conceptually great, but they compare models ex ante and are computationally heavy.
When comparing many complex hierarchical models, e.g., for parsimony of RE structures, we might want ex post comparisons that are very efficient to compute / approximate.
deviance score
\[ D(y^{(\text{rep})}, \theta) = -2 \log(P(y\mid \theta)) \ \ \ \ \ \ \ \ \ \ \ \ \text{(lower is better)} \]
un-Bayesian IC
\[ \begin{align*} \text{AIC} & = D(y^{(\text{rep})}, \theta^*) + 2 p_{\text{AIC}} & \text{with } \theta^* \in \arg\max_\theta P(y \mid \theta) \text{ and } p_{\text{AIC}} = \text{no. free parameters} \end{align*} \]
pseudo-Bayesian IC
\[ \begin{align*} \text{DIC} & = D(y^{(\text{rep})}, \bar{\theta}) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = \frac{1}{2} Var_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) \end{align*} \]
rather Bayesian IC
\[ \begin{align*} \text{WAIC} & = E_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) + 2 p_{\text{DIC}} \end{align*} \]
fix a model with \(P(Y \mid \theta)\) & \(P(\theta)\); \(y\) is the data set for conditioning; \(y^{(\text{rep})}\) is new data or \(y^{(\text{rep})} = y\)
log point-wise density
how (log-)likely is each (new) datum \(y^{(\text{rep})}_i\) under the posterior predictive distribution given \(y\)?
\[ \begin{align*} \text{LPD} & = \sum_{i=1}^n \log P(y^{(\text{rep})}_i \mid y) \ \ \ \ = \sum_{i=1}^n \log \int P(y^{(\text{rep})}_i \mid \theta) \ P(\theta \mid y) \ \text{d}\theta \\ & \approx \sum_{i=1}^n \log \left ( \frac{1}{S} \sum_{s = 1}^S P(y^{(\text{rep})}_i \mid \theta^s) \right ) \ \ \ \ \ \ \theta^s \sim P(\theta \mid y) \ \ \ \text{(from MCMC sampling)} \end{align*} \]
leave-one-out cross-validation
how (log-)likely is each old datum \(y_i\) under the posterior predictive distribution given \(y_{-i}\)?
\[ \text{LOO} = \sum_{i=1}^n \log P(y_i \mid y_{-i}) \ \ \ \ = \sum_{i=1}^n \log \int P(y_i \mid \theta) \ P(\theta \mid y_{-i}) \ \text{d}\theta \]
loo
efficiently estimates LOO score (with estimate of standard error) from raw MCMC output
loo_scores = loo(FE, VarInt, VarIntSlo, MaxME) show(loo_scores)
## LOOIC SE ## FE 995.41 49.72 ## VarInt 881.79 53.17 ## VarIntSlo 883.00 53.23 ## MaxME 882.02 52.63 ## FE - VarInt 113.62 21.56 ## FE - VarIntSlo 112.41 21.60 ## FE - MaxME 113.39 21.82 ## VarInt - VarIntSlo -1.21 1.63 ## VarInt - MaxME -0.23 4.03 ## VarIntSlo - MaxME 0.98 3.53
NB:
brms::kfold
implements \(k\)-fold cross-validationloo
uses clever approximation and is leagues faster than kfold
with \(k=1\)
terminology
blueprint of a GLM
\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]
type of \(y\) | (inverse) link function | likelihood function |
---|---|---|
metric | \(\mu = \eta\) | \(y \sim \text{Normal}(\mu, \sigma)\) |
binary | \(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\) | \(y \sim \text{Binomial}(\mu)\) |
nominal | \(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\) | \(y \sim \text{Multinomial}({\mu})\) |
ordinal | \(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\) | \(y \sim \text{Multinomial}({\mu})\) |
count | \(\mu = \exp(\eta)\) | \(y \sim \text{Poisson}(\mu)\) |
Fixed effects (main effects, population-level effects) are, roughly, the effects associated with our experimental intervention. We expect fixed effects to be exactly recreatable in a repetition of the experiment.
Mixed effects (random effects, group-level effects) capture additional variation through randomness in the realization of the experiment (e.g., randomness from the subjects we sampled or the items we used).
There are two types and one aspect of mixed effects of relevance: