Hierarchical GLMs

Michael Franke

Case study: processing relative clauses

in most languages subject relative clauses
are easier than object relative clauses
but Chinese seems to be an exception

subject relative clause

The senator who interrogated the journalist …

object relative clause

The senator who the journalist interrogated …

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

data: self-paced reading times

37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task

# A tibble: 15 × 4
    subj  item so       rt
   <dbl> <dbl> <chr> <dbl>
 1     1    13 1      1561
 2     1     6 -1      959
 3     1     5 1       582
 4     1     9 1       294
 5     1    14 -1      438
 6     1     4 -1      286
 7     1     8 -1      438
 8     1    10 -1      278
 9     1     2 -1      542
10     1    11 1       494
11     1     7 1       270
12     1     3 1       406
13     1    16 -1      374
14     1    15 1       286
15     1     1 1       246

\(\Leftarrow\) contrast coding of categorical predictor so

data from Gibson & Wu (2013)

inspect data

# A tibble: 2 × 2
  so    mean_log_rt
  <chr>       <dbl>
1 -1           6.10
2 1            6.02

fixed effects model

predict log-reading times as affected by treatment so
assume improper priors for parameters

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

fixed effects model: results

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log(rt) ~ so 
   Data: rt_data (Number of observations: 547) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     6.10      0.04     6.03     6.17 1.00     4102     3064
so1          -0.08      0.05    -0.17     0.02 1.00     4434     2854

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.60      0.02     0.57     0.64 1.00     3996     2748

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

fixed effects model: results

varying intercepts model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be “slower” or “faster” throughout

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

varying intercepts model: results

interc.+slopes model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be “slower” or “faster” throughout
assume that different subjects and items react more or less to so manipulation

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err})\\ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \end{align*} \] \[ \begin{align*} u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

interc.+slopes model: results

interc.+slopes model w/ correlation

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be “slower” or “faster” throughout
assume that different subjects and items react more or less to so manipulation
assume that random intercepts and slopes might be correlated

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) \\ \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) \\ \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \\ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \ \ \ \ \ \ \ \ \rho_u, \rho_w \sim \mathcal{U}(-1,1) \ \ \ \ \ \ \ \ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} \sim \mathcal{U}(0, \infty) \end{align*} \]

interc.+slopes model w/ corr.: results

How to choose RE-structure

two approaches:
1. “keep it maximal”
- include the maximum RE structure that “makes sense”
- what makes sense can depend on a priori conceptual considerations
- data might not be sufficient to estimate some RE coefficients
1. “let the data decide”
- fit models with varying RE-structures and compare
the former is more careful / prudent in a science context (learning about the world from the model and the data), the latter may be more adequate for an engineering context (predicting well enough with efficient models)