Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_i & \sim \mathcal{N}(0, \sigma_{\beta}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(0, 1/10000000)
  b1 ~ dnorm(0, 1/10000000)
  for (i in 1:k){
    yPred[i] = b0 + b1 * x[i]
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

overview

linear models with several metric predictors
- correlated predictors
robust regresssion
GLMs with other types of predictor variables
- binary outcomes: logistic regression
- nominal outcomes mulit-logit regression
- ordinal outcomes: ordinal (logit/probit) regression

LM with multiple predictors

linear model

data

\(y\): \(n \times 1\) vector of predicted values (metric)
\(X\): \(n \times k\) matrix of predictor values (metric)

parameters

\(\beta\): \(k \times 1\) vector of coefficients
\(\sigma_{err}\): standard deviation of Gaussian noise

model

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

example

murder_data %>% head

## # A tibble: 6 x 4
##   murder_rate low_income unemployment population
##         <dbl>      <dbl>        <dbl>      <int>
## 1        11.2       16.5          6.2     587000
## 2        13.4       20.5          6.4     643000
## 3        40.7       26.3          9.3     635000
## 4         5.3       16.5          5.3     692000
## 5        24.8       19.2          7.3    1248000
## 6        12.7       16.5          5.9     643000

predicted variable \(y\): murder_data[,1]
predictor matrix \(X\): murder_data[,2:3]

predictor variables beyond the raw data

intercept

\(X_{i,1} = 1\) for all \(i\)

interaction terms

\(X_{i,l} = X_{i,j} \cdot X_{i,k}\) for all \(i\) with \(j \neq k\)

transformed variables

\(X_{i,k} = F(X_{i,j})\) (e.g., \(F(x) = \log(x)\) or \(F(x) = x^2\))

crazy terms\(^*\)

\(X_{i,l} = \exp((X_{i,j} - X_{i,k})^2 + \sqrt{X_{i,m}})\)

\(^*\) at which point it gets pointless to speak of a "linear model" still, even if formally subsumable

recall: murder data

GGally::ggpairs(murder_data, 
                title = "Murder rate data")

multiple predictors

point estimates

glm(murder_rate ~ low_income + population, 
    data = murder_data) %>% 
  broom::tidy()

##          term      estimate    std.error statistic      p.value
## 1 (Intercept) -3.121522e+01 8.300600e+00 -3.760598 0.0015586420
## 2  low_income  2.595519e+00 4.032796e-01  6.436029 0.0000061507
## 3  population  4.198061e-07 7.674560e-07  0.547010 0.5914808122

Bayes

BF_fit_1 = BayesFactor::lmBF(
  murder_rate ~ low_income + population, 
  data = murder_data %>% as.data.frame)
post_samples_1 = BayesFactor::posterior(
  model = BF_fit_1, 
  iterations = 5000)

correlated predictors

point estimates

glm(murder_rate ~ low_income + unemployment, 
    data = murder_data) %>% 
  broom::tidy()

##           term   estimate std.error statistic      p.value
## 1  (Intercept) -34.072533 6.7265451 -5.065384 9.558945e-05
## 2   low_income   1.223931 0.5682047  2.154031 4.587777e-02
## 3 unemployment   4.398936 1.5261685  2.882340 1.034228e-02

Bayes

BF_fit_2 = BayesFactor::lmBF(
  murder_rate ~ low_income + unemployment, 
  data = murder_data %>% as.data.frame)
post_samples_2 = BayesFactor::posterior(
  model = BF_fit_2, 
  iterations = 5000)

known model with (un)correlated predictors

uncorrelated

Kruschke_Fig18_1

correlated

Kruschke_Fig18_1

joint posterior distribution

murder_rate ~ low_income + population

murder_rate ~ low_income + unemployment

model comparison

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + unemployment + population,
  data = murder_data %>% as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation

BF_complex = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  rscaleCont = 5, # set a wider prior for all slopes
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_complex)

correlated predictors: summary

general problem

correlated predictors distort parameter inference in linear models
(same for point-estimation and Bayes)

point estimates

perfect correlation \(\Rightarrow\) no unique best fit \(\Rightarrow\) inapplicable
strong correlation \(\Rightarrow\) estimation algorithm may implode

Bayes

perfect correlation \(\Rightarrow\) no conceptual problem
computation can be harder but priors can mediate

robust regression

noise model in the likelihood function

linear model (so far)

standard likelihood function: normal distribution

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

\(\Rightarrow\) encodes an assumption about noise

robust regression

uses Student's \(t\) distribution instead

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{t}(\mu = \eta_i, \sigma_{err}, \nu) \\ \end{align*} \]

Student's \(t\) distribution

robustness against outliers

Kruscke_Fig16_1

many robust noise models conceivable

systematic and independent outlier group

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim (1-\epsilon) \mathcal{N}(\mu = \eta_i, \sigma_{err}) + \epsilon \mathcal{N}(\mu = \mu_{\text{outlier}}, \sigma_{err}) \\ \end{align*} \]

scale-dependent noise

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma = a\eta_i + b) \\ \end{align*} \]

caveat: these are just gestures towards possibilities; not necessarily fully functional!

\(t\)-test scenario

IQ data

iq_data = readr::read_csv('../data/07_Kruschke_TwoGroupIQ.csv')
summary(iq_data)

##      Score           Group          
##  Min.   : 50.00   Length:120        
##  1st Qu.: 92.75   Class :character  
##  Median :102.00   Mode  :character  
##  Mean   :104.13                     
##  3rd Qu.:112.00                     
##  Max.   :208.00

possible research questions?

is average IQ-score higher than 100 in treatment group?
is average IQ-score higher in treatment group than in control?

from Kruschke (2015, Chapter 16)

Case 1: IQ-score higher than 100 in treatment group?

Bayesian GLM:

only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(100, 1/15^2)
  for (i in 1:k){
    yPred[i] = b0
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

posterior inference: results

## # A tibble: 1 x 4
##   Parameter   HDI_lo     mean   HDI_hi
##      <fctr>    <dbl>    <dbl>    <dbl>
## 1        b0 96.11298 102.8397 109.3491

Case 2: difference between groups

Bayesian GLM:

predictor variable for group x = ifelse(Group == "Control", 0, 1) (dummy coding)
only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \\ \beta_1 & \sim \mathcal{N}(0, 30) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(100, 1/1/15^2)
  b1 ~ dnorm(0, 1/1/30^2)
  for (i in 1:k){
    yPred[i] = b0
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

posterior inference: results

## # A tibble: 2 x 4
##   Parameter     HDI_lo       mean    HDI_hi
##      <fctr>      <dbl>      <dbl>     <dbl>
## 1        b0 94.4408927 100.098591 105.73580
## 2        b1 -0.3854978   7.678485  15.38381

GLM link functions

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

logistic regression for binomial data

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

logistic function

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

dummy

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

each datum \(y_i \in \set{1, \dots, k}\), unordered
one linear predictor for all categories \(j \in \set{1, \dots, k}\)

\[ \begin{align*} \eta_j & = X_i \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

each datum \(y_i \in \set{1, \dots, k}\), ordered
\(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

summary

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

outlook

Tuesday

packages & tools for Bayesian regression
multi-level / hierarchical / mixed effects regression
WAIC & LOO

Friday

rounding off
projects