\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

overview

overview

 

  • GLMs with other types of predictor variables
    • binary outcomes: logistic regression
    • nominal outcomes mulit-logit regression
    • ordinal outcomes: ordinal (logit/probit) regression
  • mixed effects
  • cross-validation, loo and information criteria

recap

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_1 & \sim \mathcal{N}(0, \sigma_{\beta}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(0, 1/10000000)
  b1 ~ dnorm(0, 1/10000000)
  for (i in 1:k){
    yPred[i] = b0 + b1 * x[i]
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

GLM link functions

common link & likelihood function

logistic regression for binomial data

 

\[ \begin{align*} \eta & = X \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

 

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

 

  • each datum \(y_i \in \set{1, \dots, k}\), unordered
  • one linear predictor for all categories \(j \in \set{1, \dots, k}\)

 

\[ \begin{align*} \eta_j & = X \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

  • each datum \(y_i \in \set{1, \dots, k}\), ordered
  • \(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

tools for Bayesian regression

tools for Bayesian regression

package BayesFactor

pro: very fast, nice comparison of nested models, good default priors on coefficients, mixed effects

con: only metric predicted variables in regression, not (easily) expandable

 

package brms (Bayesian Regression Models using Stan)

pro: efficient HMC (Stan), supports full GLM family with mixed effects, Bayes factor computation for nested models, Stan code inspection

con: slow pre-sampling phase

 

package rstanarm (Applied Regression Modelling)

pro: efficient HMC (Stan), Stan code inspection, many sources & strong development team

con: slow pre-sampling phase, not all link functions supported with mixed effects yet (?)

mixed effects

background on example data

 

  • in most languages subject relative clauses
    are easier than object relative clauses

  • but Chinese seems to be an exception

 

 

subject relative clause

The senator who interrogated the journalist …

object relative clause

The senator who the journalist interrogated …

data: self-paced reading times

37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task

 

rt_data = readr::read_delim('../data/08_gibsonwu2012data.txt', delim = " ") %>% 
  filter(region == "headnoun") %>% 
  mutate(so = ifelse(type == "subj-ext", "-1", "1")) %>% 
  select(subj, item, so, rt)
head(rt_data)
## # A tibble: 6 x 4
##    subj  item    so    rt
##   <int> <int> <chr> <int>
## 1     1    13     1  1561
## 2     1     6    -1   959
## 3     1     5     1   582
## 4     1     9     1   294
## 5     1    14    -1   438
## 6     1     4    -1   286

\(\Leftarrow\) contrast coding of categorical predictor so

data from Gibson & Wu (2013)

inspect data

 

rt_data %>% group_by(so) %>% 
  summarize(mean_log_rt = rt%>%log%>%mean)
## # A tibble: 2 x 2
##      so mean_log_rt
##   <chr>       <dbl>
## 1    -1    6.099562
## 2     1    6.022660
rt_data %>% 
ggplot(aes(x = so, y = log(rt))) + 
  geom_violin() + 
  geom_point(position = "jitter", 
             color = "skyblue")

fixed effects model

  • predict log-reading times as affected by treatment so

  • assume improper priors for parameters

 

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]  

FE = brms::brm(log(rt) ~ so, data = rt_data) # assumes improper priors per default

fixed effects model: results

 

summary(FE)
##  Family: gaussian(identity) 
## Formula: log(rt) ~ so 
##    Data: rt_data (Number of observations: 547) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##     ICs: LOO = Not computed; WAIC = Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     6.10      0.04     6.03     6.17       3611    1
## so1          -0.08      0.05    -0.18     0.02       4000    1
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma      0.6      0.02     0.56     0.64       4000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

fixed effects model: results

brms::stanplot(FE, type = "dens_overlay")

underlying Stan code

brms::stancode(FE)
## // generated with brms 1.7.0
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   vector[N] Y;  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
##   int Kc; 
##   matrix[N, K - 1] Xc;  // centered version of X 
##   vector[K - 1] means_X;  // column means of X before centering 
##   Kc = K - 1;  // the intercept is removed from the design matrix 
##   for (i in 2:K) { 
##     means_X[i - 1] = mean(X[, i]); 
##     Xc[, i - 1] = X[, i] - means_X[i - 1]; 
##   } 
## } 
## parameters { 
##   vector[Kc] b;  // population-level effects 
##   real temp_Intercept;  // temporary intercept 
##   real<lower=0> sigma;  // residual SD 
## } 
## transformed parameters { 
## } 
## model { 
##   vector[N] mu; 
##   mu = Xc * b + temp_Intercept; 
##   // prior specifications 
##   sigma ~ student_t(3, 0, 10); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ normal(mu, sigma); 
##   } 
## } 
## generated quantities { 
##   real b_Intercept;  // population-level intercept 
##   b_Intercept = temp_Intercept - dot_product(means_X, b); 
## }

varying intercepts model

  • predict log-reading times as affected by treatment so

  • assume improper priors for parameters

  • assume that different subjects and items could be "slower" or "faster" throughout

 

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]  

VarInt = brms::brm(log(rt) ~ (1 | subj + so) + so, data = rt_data) # interc. cond. on `subj` & `item`

varying intercepts model: results

 

brms::stanplot(VarInt, type = "dens_overlay")

varying intercepts & slopes model

  • predict log-reading times as affected by treatment so

  • assume improper priors for parameters

  • assume that different subjects and items could be "slower" or "faster" throughout

  • assume that different subjects and items react more or less to so manipulation

 

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]  

VarIntSlo = brms::brm(log(rt) ~ so + (1 + so || subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# double || means: no correlation

varying intercepts & slopes model: results

 

brms::stanplot(VarIntSlo, type = "dens_overlay")

varying intercepts & slopes model with correlation

  • predict log-reading times as affected by treatment so

  • assume improper priors for parameters

  • assume that different subjects and items could be "slower" or "faster" throughout

  • assume that different subjects and items react more or less to so manipulation

  • assume that random intercepts and slopes might be correlated

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) & \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \ \ \ \ \ \ \ \ \rho_u, \rho_w \sim \mathcal{U}(-1,1) \end{align*} \]

MaxME = brms::brm(log(rt) ~ so + (1 + so | subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# single | means: correlation between `subj` and `item` random effects

varying intercepts & slopes model with correlation: results

 

brms::stanplot(MaxME, type = "dens_overlay")

leave-one-out cross-validation

information criteria

deviance score

\[ D(y^{(\text{rep})}, \theta) = -2 \log(P(y\mid \theta)) \ \ \ \ \ \ \ \ \ \ \ \ \text{(lower is better)} \]

un-Bayesian IC

\[ \begin{align*} \text{AIC} & = D(y^{(\text{rep})}, \theta^*) + 2 p_{\text{AIC}} & \text{with } \theta^* \in \arg\max_\theta P(y \mid \theta) \text{ and } p_{\text{AIC}} = \text{no. free parameters} \end{align*} \]

pseudo-Bayesian IC

\[ \begin{align*} \text{DIC} & = D(y^{(\text{rep})}, \bar{\theta}) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = \frac{1}{2} Var_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) \end{align*} \]

rather Bayesian IC

\[ \begin{align*} \text{WAIC} & = E_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = p_{\text{WAIC}} \end{align*} \]

fix a model with \(P(Y \mid \theta)\) & \(P(\theta)\); \(y\) is the data set for conditioning; \(y^{(\text{rep})}\) is new data or \(y^{(\text{rep})} = y\)

leave-one-out cross-validation

log point-wise density

how (log-)likely is each (new) datum \(y^{(\text{rep})}_i\) under the posterior predictive distribution given \(y\)?

\[ \begin{align*} \text{LPD} & = \sum_{i=1}^n \log P(y^{(\text{rep})}_i \mid y) \ \ \ \ = \sum_{i=1}^n \log \int P(y^{(\text{rep})}_i \mid \theta) \ P(\theta \mid y) \ \text{d}\theta \\ & \approx \sum_{i=1}^n \log \left ( \frac{1}{S} \sum_{s = 1}^S P(y^{(\text{rep})}_i \mid \theta^s) \right ) \ \ \ \ \ \ \theta^s \sim P(\theta \mid y) \ \ \ \text{(from MCMC sampling)} \end{align*} \]

 

leave-one-out cross-validation

how (log-)likely is each old datum \(y_i\) under the posterior predictive distribution given \(y_{-i}\)?

\[ \text{LOO} = \sum_{i=1}^n \log P(y_i \mid y_{-i}) \ \ \ \ = \sum_{i=1}^n \log \int P(y_i \mid \theta) \ P(\theta \mid y_{-i}) \ \text{d}\theta \]

package loo

efficiently estimates LOO score (with estimate of standard error) from raw MCMC output

loo_scores = loo(FE, VarInt, VarIntSlo, MaxME)
loo_scores
##                     LOOIC    SE
## FE                 995.55 49.70
## VarInt             881.73 53.20
## VarIntSlo          883.68 53.29
## MaxME              882.33 52.83
## FE - VarInt        113.82 21.55
## FE - VarIntSlo     111.87 21.53
## FE - MaxME         113.22 21.87
## VarInt - VarIntSlo  -1.95  1.71
## VarInt - MaxME      -0.60  3.98
## VarIntSlo - MaxME    1.35  3.45

wrap-up

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

Stan course

 

Introduction to Bayesian Modeling using Stan

taught by Shravan Vasishth and Bruno Nicenboim

on September 17 2017 in Tübingen (part of Methoden & Evaluation)

more information here