Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

overview

GLMs with other types of predictor variables
- binary outcomes: logistic regression
- nominal outcomes mulit-logit regression
- ordinal outcomes: ordinal (logit/probit) regression
mixed effects
cross-validation, loo and information criteria

recap

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_1 & \sim \mathcal{N}(0, \sigma_{\beta}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(0, 1/10000000)
  b1 ~ dnorm(0, 1/10000000)
  for (i in 1:k){
    yPred[i] = b0 + b1 * x[i]
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

GLM link functions

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

logistic regression for binomial data

\[ \begin{align*} \eta & = X \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

each datum \(y_i \in \set{1, \dots, k}\), unordered
one linear predictor for all categories \(j \in \set{1, \dots, k}\)

\[ \begin{align*} \eta_j & = X \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

each datum \(y_i \in \set{1, \dots, k}\), ordered
\(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

tools for Bayesian regression

package BayesFactor

pro: very fast, nice comparison of nested models, good default priors on coefficients, mixed effects

con: only metric predicted variables in regression, not (easily) expandable

package brms (Bayesian Regression Models using Stan)

pro: efficient HMC (Stan), supports full GLM family with mixed effects, Bayes factor computation for nested models, Stan code inspection

con: slow pre-sampling phase

package rstanarm (Applied Regression Modelling)

pro: efficient HMC (Stan), Stan code inspection, many sources & strong development team

con: slow pre-sampling phase, not all link functions supported with mixed effects yet (?)

mixed effects

background on example data

in most languages subject relative clauses
are easier than object relative clauses
but Chinese seems to be an exception

subject relative clause

The senator who interrogated the journalist …

object relative clause

The senator who the journalist interrogated …

data: self-paced reading times

37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task

rt_data = readr::read_delim('../data/08_gibsonwu2012data.txt', delim = " ") %>% 
  filter(region == "headnoun") %>% 
  mutate(so = ifelse(type == "subj-ext", "-1", "1")) %>% 
  select(subj, item, so, rt)
head(rt_data)

## # A tibble: 6 x 4
##    subj  item    so    rt
##   <int> <int> <chr> <int>
## 1     1    13     1  1561
## 2     1     6    -1   959
## 3     1     5     1   582
## 4     1     9     1   294
## 5     1    14    -1   438
## 6     1     4    -1   286

\(\Leftarrow\) contrast coding of categorical predictor so

data from Gibson & Wu (2013)

inspect data

rt_data %>% group_by(so) %>% 
  summarize(mean_log_rt = rt%>%log%>%mean)

## # A tibble: 2 x 2
##      so mean_log_rt
##   <chr>       <dbl>
## 1    -1    6.099562
## 2     1    6.022660

rt_data %>% 
ggplot(aes(x = so, y = log(rt))) + 
  geom_violin() + 
  geom_point(position = "jitter", 
             color = "skyblue")

fixed effects model

predict log-reading times as affected by treatment so
assume improper priors for parameters

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

FE = brms::brm(log(rt) ~ so, data = rt_data) # assumes improper priors per default

fixed effects model: results

summary(FE)

##  Family: gaussian(identity) 
## Formula: log(rt) ~ so 
##    Data: rt_data (Number of observations: 547) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; 
##          total post-warmup samples = 4000
##     ICs: LOO = Not computed; WAIC = Not computed
##  
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     6.10      0.04     6.03     6.17       3611    1
## so1          -0.08      0.05    -0.18     0.02       4000    1
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma      0.6      0.02     0.56     0.64       4000    1
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

fixed effects model: results

brms::stanplot(FE, type = "dens_overlay")

underlying Stan code

brms::stancode(FE)

## // generated with brms 1.7.0
## functions { 
## } 
## data { 
##   int<lower=1> N;  // total number of observations 
##   vector[N] Y;  // response variable 
##   int<lower=1> K;  // number of population-level effects 
##   matrix[N, K] X;  // population-level design matrix 
##   int prior_only;  // should the likelihood be ignored? 
## } 
## transformed data { 
##   int Kc; 
##   matrix[N, K - 1] Xc;  // centered version of X 
##   vector[K - 1] means_X;  // column means of X before centering 
##   Kc = K - 1;  // the intercept is removed from the design matrix 
##   for (i in 2:K) { 
##     means_X[i - 1] = mean(X[, i]); 
##     Xc[, i - 1] = X[, i] - means_X[i - 1]; 
##   } 
## } 
## parameters { 
##   vector[Kc] b;  // population-level effects 
##   real temp_Intercept;  // temporary intercept 
##   real<lower=0> sigma;  // residual SD 
## } 
## transformed parameters { 
## } 
## model { 
##   vector[N] mu; 
##   mu = Xc * b + temp_Intercept; 
##   // prior specifications 
##   sigma ~ student_t(3, 0, 10); 
##   // likelihood contribution 
##   if (!prior_only) { 
##     Y ~ normal(mu, sigma); 
##   } 
## } 
## generated quantities { 
##   real b_Intercept;  // population-level intercept 
##   b_Intercept = temp_Intercept - dot_product(means_X, b); 
## }

varying intercepts model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

VarInt = brms::brm(log(rt) ~ (1 | subj + so) + so, data = rt_data) # interc. cond. on `subj` & `item`

varying intercepts model: results

brms::stanplot(VarInt, type = "dens_overlay")

varying intercepts & slopes model

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so manipulation

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]

VarIntSlo = brms::brm(log(rt) ~ so + (1 + so || subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# double || means: no correlation

varying intercepts & slopes model: results

brms::stanplot(VarIntSlo, type = "dens_overlay")

varying intercepts & slopes model with correlation

predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so manipulation
assume that random intercepts and slopes might be correlated

\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) & \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \ \ \ \ \ \ \ \ \rho_u, \rho_w \sim \mathcal{U}(-1,1) \end{align*} \]

MaxME = brms::brm(log(rt) ~ so + (1 + so | subj + item), data = rt_data)
# intercept and slope for `so` varies with values for `subj` and `item`
# single | means: correlation between `subj` and `item` random effects

varying intercepts & slopes model with correlation: results

brms::stanplot(MaxME, type = "dens_overlay")

leave-one-out cross-validation

information criteria

deviance score

\[ D(y^{(\text{rep})}, \theta) = -2 \log(P(y\mid \theta)) \ \ \ \ \ \ \ \ \ \ \ \ \text{(lower is better)} \]

un-Bayesian IC

\[ \begin{align*} \text{AIC} & = D(y^{(\text{rep})}, \theta^*) + 2 p_{\text{AIC}} & \text{with } \theta^* \in \arg\max_\theta P(y \mid \theta) \text{ and } p_{\text{AIC}} = \text{no. free parameters} \end{align*} \]

pseudo-Bayesian IC

\[ \begin{align*} \text{DIC} & = D(y^{(\text{rep})}, \bar{\theta}) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = \frac{1}{2} Var_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) \end{align*} \]

rather Bayesian IC

\[ \begin{align*} \text{WAIC} & = E_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = p_{\text{WAIC}} \end{align*} \]

fix a model with \(P(Y \mid \theta)\) & \(P(\theta)\); \(y\) is the data set for conditioning; \(y^{(\text{rep})}\) is new data or \(y^{(\text{rep})} = y\)

leave-one-out cross-validation

log point-wise density

how (log-)likely is each (new) datum \(y^{(\text{rep})}_i\) under the posterior predictive distribution given \(y\)?

\[ \begin{align*} \text{LPD} & = \sum_{i=1}^n \log P(y^{(\text{rep})}_i \mid y) \ \ \ \ = \sum_{i=1}^n \log \int P(y^{(\text{rep})}_i \mid \theta) \ P(\theta \mid y) \ \text{d}\theta \\ & \approx \sum_{i=1}^n \log \left ( \frac{1}{S} \sum_{s = 1}^S P(y^{(\text{rep})}_i \mid \theta^s) \right ) \ \ \ \ \ \ \theta^s \sim P(\theta \mid y) \ \ \ \text{(from MCMC sampling)} \end{align*} \]

leave-one-out cross-validation

how (log-)likely is each old datum \(y_i\) under the posterior predictive distribution given \(y_{-i}\)?

\[ \text{LOO} = \sum_{i=1}^n \log P(y_i \mid y_{-i}) \ \ \ \ = \sum_{i=1}^n \log \int P(y_i \mid \theta) \ P(\theta \mid y_{-i}) \ \text{d}\theta \]

package `loo`

efficiently estimates LOO score (with estimate of standard error) from raw MCMC output

loo_scores = loo(FE, VarInt, VarIntSlo, MaxME)
loo_scores

##                     LOOIC    SE
## FE                 995.55 49.70
## VarInt             881.73 53.20
## VarIntSlo          883.68 53.29
## MaxME              882.33 52.83
## FE - VarInt        113.82 21.55
## FE - VarIntSlo     111.87 21.53
## FE - MaxME         113.22 21.87
## VarInt - VarIntSlo  -1.95  1.71
## VarInt - MaxME      -0.60  3.98
## VarIntSlo - MaxME    1.35  3.45

wrap-up

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

Stan course

Introduction to Bayesian Modeling using Stan

taught by Shravan Vasishth and Bruno Nicenboim

on September 17 2017 in Tübingen (part of Methoden & Evaluation)

more information here

overview

overview

recap

generalized linear model

linear regression: a Bayesian approach

GLM link functions

common link & likelihood function

logistic regression for binomial data

multi-logit regression for multinomial data

ordinal (probit) regression

threshold-Phi model

tools for Bayesian regression

tools for Bayesian regression

mixed effects

background on example data

data: self-paced reading times

inspect data

fixed effects model

fixed effects model: results

fixed effects model: results

underlying Stan code

varying intercepts model

varying intercepts model: results

varying intercepts & slopes model

varying intercepts & slopes model: results

varying intercepts & slopes model with correlation

varying intercepts & slopes model with correlation: results

leave-one-out cross-validation

information criteria

leave-one-out cross-validation

package loo

wrap-up

generalized linear model

Stan course

package `loo`