Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_i & \sim \mathcal{U}(-\infty, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \mathcal{U}(-\infty, \infty) \end{align*} \]

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] murder_rate;  // dependent variable 
  vector[N] low_income;  // independent variable
} 
parameters { 
  real Intercept;  real beta;  
  real<lower=0> sigma;  
} 
model { 
  // predictor value = expected value
  vector[N] mu = Intercept + low_income * beta;
  // likelihood 
  target += normal_lpdf(murder_rate | mu, sigma);
}

overview

linear models with several metric predictors
- multiplicative interactions
- correlated predictors
robust regression
one categorical predictor
GLMs with other types of predicted variables
- binary outcomes: logistic regression
- nominal outcomes multi-logit regression
- ordinal outcomes: ordinal (logit/probit) regression
case study: logistic regression with multiple categorical predictors

LM with multiple predictors

linear model

data

\(y\): \(n \times 1\) vector of predicted variables (metric)
\(X\): \(n \times k\) matrix of predictor variables (metric)

parameters

\(\beta\): \(k \times 1\) vector of coefficients
\(\sigma_{err}\): standard deviation of Gaussian noise

model

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

example

murder_data %>% head

## # A tibble: 6 x 4
##   murder_rate low_income unemployment population
##         <dbl>      <dbl>        <dbl>      <dbl>
## 1        11.2       16.5          6.2     587000
## 2        13.4       20.5          6.4     643000
## 3        40.7       26.3          9.3     635000
## 4         5.3       16.5          5.3     692000
## 5        24.8       19.2          7.3    1248000
## 6        12.7       16.5          5.9     643000

predicted variable \(y\): murder_data[,1]
predictor matrix \(X\): murder_data[,2:4]

predictor value for two predictor variables

multiple_metric_predictors

transformed predictor variables

intercept

\(X_{i,1} = 1\) for all \(i\)

multiplicative interaction

\(X_{i,l} = X_{i,j} \cdot X_{i,k}\) for all \(i\) with \(j \neq k\)

transformed variables

\(X_{i,k} = F(X_{i,j})\) (e.g., \(F(x) = \log(x)\) or \(F(x) = x^2\))

crazy terms\(^*\)

\(X_{i,l} = \exp((X_{i,j} - X_{i,k})^2 + \sqrt{X_{i,m}})\)

\(^*\) at which point it gets pointless to speak of a "linear model" still, even if formally subsumable

multiplicative interaction

recall: murder data

GGally::ggpairs(murder_data, 
                title = "Murder rate data")

multiple predictors

point estimates

glm(murder_rate ~ low_income + population, 
    data = murder_data) %>% 
  broom::tidy()

## # A tibble: 3 x 5
##   term             estimate   std.error statistic    p.value
##   <chr>               <dbl>       <dbl>     <dbl>      <dbl>
## 1 (Intercept) -31.2         8.30           -3.76  0.00156   
## 2 low_income    2.60        0.403           6.44  0.00000615
## 3 population    0.000000420 0.000000767     0.547 0.591

Bayes

BF_fit_1 = BayesFactor::lmBF(
  murder_rate ~ low_income + population, 
  data = murder_data %>% as.data.frame)
post_samples_1 = BayesFactor::posterior(
  model = BF_fit_1, 
  iterations = 5000)

correlated predictors

point estimates

glm(murder_rate ~ low_income + unemployment, 
    data = murder_data) %>% 
  broom::tidy()

## # A tibble: 3 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    -34.1      6.73      -5.07 0.0000956
## 2 low_income       1.22     0.568      2.15 0.0459   
## 3 unemployment     4.40     1.53       2.88 0.0103

Bayes

BF_fit_2 = BayesFactor::lmBF(
  murder_rate ~ low_income + unemployment, 
  data = murder_data %>% as.data.frame)
post_samples_2 = BayesFactor::posterior(
  model = BF_fit_2, 
  iterations = 5000)

known model with (un)correlated predictors

uncorrelated

Kruschke_Fig18_1

correlated

Kruschke_Fig18_1

joint posterior distribution

murder_rate ~ low_income + population

murder_rate ~ low_income + unemployment

model comparison

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + unemployment + population,
  data = murder_data %>% as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation (Bayes)

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_group_fit)

method comparison: point-estimates vs. Bayes

murder_data = murder_data %>% mutate(inv_income = -low_income)

glm(murder_rate ~ low_income + inv_income, data = murder_data) %>% broom::tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)   -29.9      7.79      -3.84 0.00120   
## 2 low_income      2.56     0.390      6.56 0.00000364

m = brm(murder_rate ~ low_income + inv_income, data = murder_data)
broom::tidy(m)

##           term  estimate   std.error        lower       upper
## 1  b_Intercept -30.10417    8.442471   -43.810329  -16.042556
## 2 b_low_income 222.92690 3737.368597 -6061.069129 5405.508617
## 3 b_inv_income 220.36852 3737.365239 -6063.784900 5403.503213
## 4        sigma   5.88133    1.367383     4.107823    8.354835
## 5         lp__ -67.65668    1.338055   -70.020305  -66.075633

posterior of perfectly (inversely) correlated predictors

posterior_samples(m) %>% as.tibble() -> n
qplot(x = n$b_low_income, y = n$b_inv_income) +
  xlab("low income") + ylab("inverse low-income")

correlated predictors: summary

general problem

correlated predictors distort parameter inference in linear models
(same for point-estimation and Bayes)

point estimates

perfect correlation \(\Rightarrow\) no unique best fit \(\Rightarrow\) inapplicable
strong correlation \(\Rightarrow\) estimation algorithm may implode (silently)

Bayes

perfect correlation \(\Rightarrow\) no conceptual problem
problem clearly visible in diffuse marginal posterior and correlated joined posterior densities

robust regression

noise model in the likelihood function

linear model (so far)

standard likelihood function: normal distribution

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

\(\Rightarrow\) encodes an assumption about noise

robust regression

uses Student's \(t\) distribution instead

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{t}(\mu = \eta_i, \sigma_{err}, \nu) \\ \end{align*} \]

Student's \(t\) distribution

robustness against outliers

Kruscke_Fig16_1

many robust noise models conceivable

systematic and independent outlier group

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim (1-\epsilon) \mathcal{N}(\mu = \eta_i, \sigma_{err}) + \epsilon \mathcal{N}(\mu = \mu_{\text{outlier}}, \sigma_{err}) \\ \end{align*} \]

scale-dependent noise

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma = a\eta_i + b) \\ \end{align*} \]

caveat: these are just gestures towards possibilities; not necessarily fully functional!

\(t\)-test scenario

IQ data

iq_data = readr::read_csv('data/03_Kruschke_TwoGroupIQ.csv')
summary(iq_data)

##      Score           Group          
##  Min.   : 50.00   Length:120        
##  1st Qu.: 92.75   Class :character  
##  Median :102.00   Mode  :character  
##  Mean   :104.13                     
##  3rd Qu.:112.00                     
##  Max.   :208.00

possible research questions?

is average IQ-score higher than 100 in treatment group?
is average IQ-score higher in treatment group than in control?

from Kruschke (2015, Chapter 16)

Case 1: IQ-score higher than 100 in treatment group?

Bayesian GLM:

inspect posterior distribution over \(\beta_0\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, \frac{1}{10000}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \text{N}(0,10000)I[0,] \end{align*} \]

posterior inference: results

## # A tibble: 1 x 4
##   Parameter HDI_lo  mean HDI_hi
##   <fct>      <dbl> <dbl>  <dbl>
## 1 b0          101.  108.   114.

t-test subsumed under GLM

brm_fit_ttest = brm(iq_score_treatment ~ 1, 
                    data = tibble(iq_score_treatment = iq_score_treatment))
brm_fit_ttest

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: iq_score_treatment ~ 1 
##    Data: tibble(iq_score_treatment = iq_score_treatment) (Number of observations: 63) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   107.85      3.23   101.47   114.25       2824 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma    25.63      2.34    21.48    30.65       2867 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Case 2: difference between groups

Bayesian GLM:

predictor variable for group x = ifelse(Group == "Control", 0, 1) (dummy coding)
only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \mathcal{N}(0,100)T[0;\infty] \\ \beta_1 & \sim \mathcal{N}(0, 30) \end{align*} \]

posterior inference: results

prior1 = c(set_prior("normal(0,30)", class = "b", coef= "GroupSmartDrug"),
            set_prior("normal(0, 100)", class = "sigma"),
            set_prior("normal(100,15)", class = "Intercept"))

brm_fit = brm(
  formula = Score ~ Group, 
  data = iq_data, 
  prior = prior1
  )

##               term    estimate std.error       lower      upper
## 1      b_Intercept  100.100380  2.946309   95.262982  105.05456
## 2 b_GroupSmartDrug    7.618633  4.097243    1.007568   14.36550
## 3            sigma   22.427686  1.489774   20.101028   24.95942
## 4             lp__ -552.498173  1.241103 -554.954430 -551.15690

GLM link functions

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

logistic regression for binomial data

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

logistic function

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

dummy

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

each datum \(y_i \in \set{1, \dots, k}\), unordered
one linear predictor for all categories \(j \in \set{1, \dots, k}\)

\[ \begin{align*} \eta_j & = X_i \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

each datum \(y_i \in \set{1, \dots, k}\), ordered
\(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

summary

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

outlook

Tuesday

case study (cont.)
multi-level / hierarchical / mixed effects regression

after that

practice