\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
terminology
blueprint of a GLM
\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]
Bayes: likelihood + prior
inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:
\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_1 & \sim \mathcal{N}(0, \sigma_{\beta}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]
model{ sigma_e = 1/sqrt(tau_e) tau_e ~ dgamma(0.1,0.1) b0 ~ dnorm(0, 1/10000000) b1 ~ dnorm(0, 1/10000000) for (i in 1:k){ yPred[i] = b0 + b1 * x[i] y[i] ~ dnorm(yPred[i], tau_e) } }
type of \(y\) | (inverse) link function | likelihood function |
---|---|---|
metric | \(\mu = \eta\) | \(y \sim \text{Normal}(\mu, \sigma)\) |
binary | \(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\) | \(y \sim \text{Binomial}(\mu)\) |
nominal | \(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\) | \(y \sim \text{Multinomial}({\mu})\) |
ordinal | \(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\) | \(y \sim \text{Multinomial}({\mu})\) |
count | \(\mu = \exp(\eta)\) | \(y \sim \text{Poisson}(\mu)\) |
\[ \begin{align*} \eta & = X \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]
\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]
threshold \(\theta\)
gain \(\gamma\)
\[ \begin{align*} \eta_j & = X \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]
\[ \begin{align*} \eta & = X \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]
package BayesFactor
pro: very fast, nice comparison of nested models, good default priors on coefficients, mixed effects
con: only metric predicted variables in regression, not (easily) expandable
package brms
(Bayesian Regression Models using Stan)
pro: efficient HMC (Stan), supports full GLM family with mixed effects, Bayes factor computation for nested models, Stan code inspection
con: slow pre-sampling phase
package rstanarm
(Applied Regression Modelling)
pro: efficient HMC (Stan), Stan code inspection, many sources & strong development team
con: slow pre-sampling phase, not all link functions supported with mixed effects yet (?)
in most languages
subject relative clauses
are easier than
object relative clauses
but Chinese seems to be an exception
subject relative clause
The senator who interrogated the journalist …
object relative clause
The senator who the journalist interrogated …
37 subjects read 15 sentences either with an SRC or an ORC in a self-paced reading task
rt_data = readr::read_delim('../data/08_gibsonwu2012data.txt', delim = " ") %>% filter(region == "headnoun") %>% mutate(so = ifelse(type == "subj-ext", "-1", "1")) %>% select(subj, item, so, rt) head(rt_data)
## # A tibble: 6 x 4 ## subj item so rt ## <int> <int> <chr> <int> ## 1 1 13 1 1561 ## 2 1 6 -1 959 ## 3 1 5 1 582 ## 4 1 9 1 294 ## 5 1 14 -1 438 ## 6 1 4 -1 286
\(\Leftarrow\)
contrast coding of categorical predictor so
data from Gibson & Wu (2013)
rt_data %>% group_by(so) %>% summarize(mean_log_rt = rt%>%log%>%mean)
## # A tibble: 2 x 2 ## so mean_log_rt ## <chr> <dbl> ## 1 -1 6.099562 ## 2 1 6.022660
rt_data %>% ggplot(aes(x = so, y = log(rt))) + geom_violin() + geom_point(position = "jitter", color = "skyblue")
predict log-reading times as affected by treatment so
assume improper priors for parameters
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \beta_1 \mathtt{so}_i \\ \sigma_{err} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]
FE = brms::brm(log(rt) ~ so, data = rt_data) # assumes improper priors per default
summary(FE)
## Family: gaussian(identity) ## Formula: log(rt) ~ so ## Data: rt_data (Number of observations: 547) ## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; ## total post-warmup samples = 4000 ## ICs: LOO = Not computed; WAIC = Not computed ## ## Population-Level Effects: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## Intercept 6.10 0.04 6.03 6.17 3611 1 ## so1 -0.08 0.05 -0.18 0.02 4000 1 ## ## Family Specific Parameters: ## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat ## sigma 0.6 0.02 0.56 0.64 4000 1 ## ## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample ## is a crude measure of effective sample size, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).
brms::stanplot(FE, type = "dens_overlay")
brms::stancode(FE)
## // generated with brms 1.7.0 ## functions { ## } ## data { ## int<lower=1> N; // total number of observations ## vector[N] Y; // response variable ## int<lower=1> K; // number of population-level effects ## matrix[N, K] X; // population-level design matrix ## int prior_only; // should the likelihood be ignored? ## } ## transformed data { ## int Kc; ## matrix[N, K - 1] Xc; // centered version of X ## vector[K - 1] means_X; // column means of X before centering ## Kc = K - 1; // the intercept is removed from the design matrix ## for (i in 2:K) { ## means_X[i - 1] = mean(X[, i]); ## Xc[, i - 1] = X[, i] - means_X[i - 1]; ## } ## } ## parameters { ## vector[Kc] b; // population-level effects ## real temp_Intercept; // temporary intercept ## real<lower=0> sigma; // residual SD ## } ## transformed parameters { ## } ## model { ## vector[N] mu; ## mu = Xc * b + temp_Intercept; ## // prior specifications ## sigma ~ student_t(3, 0, 10); ## // likelihood contribution ## if (!prior_only) { ## Y ~ normal(mu, sigma); ## } ## } ## generated quantities { ## real b_Intercept; // population-level intercept ## b_Intercept = temp_Intercept - dot_product(means_X, b); ## }
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + \beta_1 \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ \sigma_{err}, \sigma_{u_0}, \sigma_{w_0} & \sim \mathcal{U}(0, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]
VarInt = brms::brm(log(rt) ~ (1 | subj + so) + so, data = rt_data) # interc. cond. on `subj` & `item`
brms::stanplot(VarInt, type = "dens_overlay")
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so
manipulation
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + \underbrace{u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i}}_{\text{varying intercepts}} + (\beta_1 + \underbrace{u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i}}_{\text{varying slopes}} ) \mathtt{so}_i \\ u_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_0}) & w_{0,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_0}) \\ u_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{u_1}) & w_{1,\mathtt{subj}_i} & \sim \mathcal{N}(0, \sigma_{w_1}) \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \end{align*} \]
VarIntSlo = brms::brm(log(rt) ~ so + (1 + so || subj + item), data = rt_data) # intercept and slope for `so` varies with values for `subj` and `item` # double || means: no correlation
brms::stanplot(VarIntSlo, type = "dens_overlay")
predict log-reading times as affected by treatment so
assume improper priors for parameters
assume that different subjects and items could be "slower" or "faster" throughout
assume that different subjects and items react more or less to so
manipulation
assume that random intercepts and slopes might be correlated
\[ \begin{align*} \log(\mathtt{rt}_i) & \sim \mathcal{N}(\eta_i, \sigma_{err}) & \eta_{i} & = \beta_0 + u_{0,\mathtt{subj}_i} + w_{0,\mathtt{item}_i} + \left (\beta_1 + u_{1,\mathtt{subj}_i} + w_{1,\mathtt{item}_i} \right ) \mathtt{so}_i \\ \begin{pmatrix}u_{0,\mathtt{subj}_i} \\ u_{1,\mathtt{subj}_i} \end{pmatrix} & \sim \mathcal{N} \left (\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \Sigma_{u} \right ) & \Sigma_{w} & = \begin{pmatrix} \sigma_{u_{0}}^2 & \rho_u\sigma_{u{0}}\sigma_{u{1}} \\ \rho_u\sigma_{u{0}}\sigma_{u{1}} & \sigma_{u_{0}}^2 \end{pmatrix} \ \ \ \text{same for } \mathtt{item} \\ \sigma_{err}, \sigma_{u_{0|1}}, \sigma_{w_{0|1}} & \sim \mathcal{U}(0, \infty) & \beta_0, \beta_1 & \sim \mathcal{U}(- \infty, \infty) \ \ \ \ \ \ \ \ \rho_u, \rho_w \sim \mathcal{U}(-1,1) \end{align*} \]
MaxME = brms::brm(log(rt) ~ so + (1 + so | subj + item), data = rt_data) # intercept and slope for `so` varies with values for `subj` and `item` # single | means: correlation between `subj` and `item` random effects
brms::stanplot(MaxME, type = "dens_overlay")
deviance score
\[ D(y^{(\text{rep})}, \theta) = -2 \log(P(y\mid \theta)) \ \ \ \ \ \ \ \ \ \ \ \ \text{(lower is better)} \]
un-Bayesian IC
\[ \begin{align*} \text{AIC} & = D(y^{(\text{rep})}, \theta^*) + 2 p_{\text{AIC}} & \text{with } \theta^* \in \arg\max_\theta P(y \mid \theta) \text{ and } p_{\text{AIC}} = \text{no. free parameters} \end{align*} \]
pseudo-Bayesian IC
\[ \begin{align*} \text{DIC} & = D(y^{(\text{rep})}, \bar{\theta}) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = \frac{1}{2} Var_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) \end{align*} \]
rather Bayesian IC
\[ \begin{align*} \text{WAIC} & = E_{P(\theta \mid y)} D(y^{(\text{rep})}, \theta) + 2 p_{\text{DIC}} && \text{with } \bar{\theta} = E_{P(\theta \mid y)}\theta \\ &&& \text{ and } p_{\text{DIC}} = p_{\text{WAIC}} \end{align*} \]
fix a model with \(P(Y \mid \theta)\) & \(P(\theta)\); \(y\) is the data set for conditioning; \(y^{(\text{rep})}\) is new data or \(y^{(\text{rep})} = y\)
log point-wise density
how (log-)likely is each (new) datum \(y^{(\text{rep})}_i\) under the posterior predictive distribution given \(y\)?
\[ \begin{align*} \text{LPD} & = \sum_{i=1}^n \log P(y^{(\text{rep})}_i \mid y) \ \ \ \ = \sum_{i=1}^n \log \int P(y^{(\text{rep})}_i \mid \theta) \ P(\theta \mid y) \ \text{d}\theta \\ & \approx \sum_{i=1}^n \log \left ( \frac{1}{S} \sum_{s = 1}^S P(y^{(\text{rep})}_i \mid \theta^s) \right ) \ \ \ \ \ \ \theta^s \sim P(\theta \mid y) \ \ \ \text{(from MCMC sampling)} \end{align*} \]
leave-one-out cross-validation
how (log-)likely is each old datum \(y_i\) under the posterior predictive distribution given \(y_{-i}\)?
\[ \text{LOO} = \sum_{i=1}^n \log P(y_i \mid y_{-i}) \ \ \ \ = \sum_{i=1}^n \log \int P(y_i \mid \theta) \ P(\theta \mid y_{-i}) \ \text{d}\theta \]
loo
efficiently estimates LOO score (with estimate of standard error) from raw MCMC output
loo_scores = loo(FE, VarInt, VarIntSlo, MaxME) loo_scores
## LOOIC SE ## FE 995.55 49.70 ## VarInt 881.73 53.20 ## VarIntSlo 883.68 53.29 ## MaxME 882.33 52.83 ## FE - VarInt 113.82 21.55 ## FE - VarIntSlo 111.87 21.53 ## FE - MaxME 113.22 21.87 ## VarInt - VarIntSlo -1.95 1.71 ## VarInt - MaxME -0.60 3.98 ## VarIntSlo - MaxME 1.35 3.45
terminology
blueprint of a GLM
\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]
Introduction to Bayesian Modeling using Stan
taught by Shravan Vasishth and Bruno Nicenboim
on September 17 2017 in Tübingen (part of Methoden & Evaluation)
more information here