Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

overview

generalized linear model (GLM)
- types of variables
  - metric, nominal, ordinal, count
linear model
- ordinary least squares regresssion
- maximum likelihood regression
- Bayesian approaches
generalization to simplest \(t\)-test scenario

generalized linear model

probabilistic models

standard notion

model = likelihood function \(P(D \mid \theta)\)

Bayesian

model = likelihood \(P(D \mid \theta)\) + prior \(P(\theta)\)

approaches to modeling

flavors_of_modeling

Generalized Context Model ::: Model

CGM

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

types of variables

type	examples
metric	speed of a car, reading time, average time spent cooking p.d., …
binary	coin flip, truth-value judgement, experience with R, …
nominal	gender, political party, favorite philosopher, …
ordinal	level of education, rating scale judgement, …
count	number of cars passing under bridge in 1h, …

common link & likelihood function

type of \(y\)	(inverse) link function	likelihood function
metric	\(\mu = \eta\)	\(y \sim \text{Normal}(\mu, \sigma)\)
binary	\(\mu = \text{logistic}(\eta, \theta, \gamma) = (1 + \exp(-\gamma (\eta - \theta)))^{-1}\)	\(y \sim \text{Binomial}(\mu)\)
nominal	\(\mu_k = \text{soft-max}(\eta_k, \lambda) \propto \exp(\lambda \eta_k)\)	\(y \sim \text{Multinomial}({\mu})\)
ordinal	\(\mu_k = \text{threshold-Phi}(\eta_k, \sigma, {\delta})\)	\(y \sim \text{Multinomial}({\mu})\)
count	\(\mu = \exp(\eta)\)	\(y \sim \text{Poisson}(\mu)\)

linear regression

muder rate data set

murder_data = readr::read_csv('data/02_murder_rates.csv') %>% 
  rename(murder_rate = annual_murder_rate_per_million_inhabitants,
         low_income = percentage_low_income, 
         unemployment = percentage_unemployment) %>% 
  select(murder_rate, low_income, unemployment,population)
murder_data %>% head

## # A tibble: 6 x 4
##   murder_rate low_income unemployment population
##         <dbl>      <dbl>        <dbl>      <int>
## 1       11.2        16.5         6.20     587000
## 2       13.4        20.5         6.40     643000
## 3       40.7        26.3         9.30     635000
## 4        5.30       16.5         5.30     692000
## 5       24.8        19.2         7.30    1248000
## 6       12.7        16.5         5.90     643000

visualize data

GGally::ggpairs(murder_data, 
                title = "Murder rate data")

zooming in

murder_sub = select(murder_data, murder_rate, low_income) 
rate_income_plot = ggplot(murder_sub, 
    aes(x = low_income, y = murder_rate)) + geom_point()
rate_income_plot

linear regression

given

predicted variable \(y\): murder rates
metric
predictor variable \(x\): percentage low income
metric

question

wich linear function –in terms of intercept \(\beta_0\) and slope \(\beta_1\)– approximates the data best?

\[ y_{\text{pred}} = \beta_0 + \beta_1 x \]

ordinary linear regression

“geometric” approach

find \(\beta_0\) and \(\beta_1\) that minimize the squared error between predicted \(y_\text{pred}\) and observed \(y\)

mse = function(y, x, beta_0, beta_1) {
  yPred = beta_0 + x * beta_1
  (y-yPred)^2 %>% mean
}
fit_mse = optim(par = c(0, 1), 
  fn = function(par) with(murder_data,
    mse(murder_rate,low_income, 
        par[1], par[2])))
fit_mse$par

## [1] -29.905188   2.559588

linear regression: MLE

maximum likelihood approach

find \(\beta_0\) and \(\beta_1\) and \(\sigma\) that maximize the likelihood of observed \(y\):

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma) \end{align*} \]

nll = function(y, x, beta_0, beta_1, sd) {
  if (sd <= 0) {return( Inf )}
  yPred = beta_0 + x * beta_1
  nll = -dnorm(y, mean=yPred, sd=sd, log = T)
  sum(nll)
}
fit_lh = optim(par = c(0, 1, 1), 
  fn = function(par) with(murder_data, 
    nll(murder_rate, low_income,
        par[1], par[2], par[3])))
fit_lh$par

## [1] -29.901445   2.559601   5.234126

compare homebrew to the “real thing”

fit_lm  =  lm(formula = murder_rate ~ low_income, data = murder_data)
fit_glm = glm(formula = murder_rate ~ low_income, data = murder_data)

tibble(parameter = c("beta_0", "beta_1"),
       manual_mse = fit_mse$par,
       manual_lh = fit_lh$par[1:2],
       lm  = fit_lm %>% coefficients,
       glm = fit_glm %>% coefficients) %>% show

## # A tibble: 2 x 5
##   parameter manual_mse manual_lh     lm    glm
##   <chr>          <dbl>     <dbl>  <dbl>  <dbl>
## 1 beta_0        -29.9     -29.9  -29.9  -29.9 
## 2 beta_1          2.56      2.56   2.56   2.56

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_i & \sim \mathcal{U}(-\infty, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \mathcal{U}(-\infty, \infty) \end{align*} \]

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] murder_rate;  // dependent variable 
  vector[N] low_income;  // independent variable
} 
parameters { 
  real Intercept;  real beta;  
  real<lower=0> sigma;  
} 
model { 
  // predictor value = expected value
  vector[N] mu = Intercept + low_income * beta;
  // likelihood 
  target += normal_lpdf(murder_rate | mu, sigma);
}

linear regression: a Bayesian approach

## # A tibble: 3 x 4
##   Parameter HDI_lo   mean HDI_hi
##   <chr>      <dbl>  <dbl>  <dbl>
## 1 beta        1.69   2.57   3.39
## 2 Intercept -47.0  -30.2  -12.8 
## 3 sigma       4.12   5.97   8.18

Using the `brms` package

fit_lm  =        lm(formula = murder_rate ~ low_income, data = murder_data)
fit_glm =       glm(formula = murder_rate ~ low_income, data = murder_data)
fit_brm = brms::brm(formula = murder_rate ~ low_income, data = murder_data)

tibble(parameter = c("beta_0", "beta_1"),
       manual_mse = fit_mse$par,
       manual_lh = fit_lh$par[1:2],
       lm  = fit_lm %>% coefficients,
       glm = fit_glm %>% coefficients,
       brm = c(summary(fit_brm)$fixed["Intercept", "Estimate"],
               summary(fit_brm)$fixed["low_income", "Estimate"])) %>% 
show

## # A tibble: 2 x 6
##   parameter manual_mse manual_lh     lm    glm    brm
##   <chr>          <dbl>     <dbl>  <dbl>  <dbl>  <dbl>
## 1 beta_0        -29.9     -29.9  -29.9  -29.9  -29.9 
## 2 beta_1          2.56      2.56   2.56   2.56   2.56

Stan model code generated from `brms::brm`

stancode(fit_brm)

data { 
  int<lower=1> N;  
  vector[N] Y;  
  int<lower=1> K; 
  matrix[N, K] X;  
  int prior_only; 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  
  vector[K - 1] means_X;  
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 
parameters { 
  vector[Kc] b; 
  real temp_Intercept;  
  real<lower=0> sigma;  
}

model { 
  vector[N] mu = temp_Intercept + Xc * b;
  // priors including all constants 
  target += student_t_lpdf(
    temp_Intercept | 3, 20, 10
    ); 
  target += student_t_lpdf(
    sigma | 3, 0, 10
    ) - 1 * student_t_lccdf(
      0 | 3, 0, 10
      ); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma);
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_Intercept = temp_Intercept - 
    dot_product(means_X, b); 
}

inference with regression models

parameter estimation

is coefficient \(\beta_i\) significantly/credibly different from, e.g., 0?

model comparison

compare model with \(\beta_i\) fixed to model with \(\beta_i\) free

prediction/model criticism

predict future data from model
could model be a likely generating model of the observed data?

significance of point-estimates of coefficients

fit_glm = glm(formula = murder_rate ~ low_income, data = murder_data)
fit_glm %>% summary

## 
## Call:
## glm(formula = murder_rate ~ low_income, data = murder_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.1663  -2.5613  -0.9552   2.8887  12.3475  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.901      7.789  -3.839   0.0012 ** 
## low_income     2.559      0.390   6.562 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 30.38125)
## 
##     Null deviance: 1855.20  on 19  degrees of freedom
## Residual deviance:  546.86  on 18  degrees of freedom
## AIC: 128.93
## 
## Number of Fisher Scoring iterations: 2

credible intervals for posteriors over coefficients

bayes_regression_samples %>% 
  group_by(Parameter) %>% 
  summarize(HDI_lo = coda::HPDinterval(as.mcmc(value))[1],
          mean = mean(value),
          HDI_hi = coda::HPDinterval(as.mcmc(value))[2])  %>% 
  show

## # A tibble: 3 x 4
##   Parameter HDI_lo   mean HDI_hi
##   <chr>      <dbl>  <dbl>  <dbl>
## 1 beta        1.69   2.57   3.39
## 2 Intercept -47.0  -30.2  -12.8 
## 3 sigma       4.12   5.97   8.18

model comparison

frequentist

fit_glm_simple = glm(formula = murder_rate ~ 1, data = murder_data)
AIC(fit_glm_simple, fit_glm)

##                df      AIC
## fit_glm_simple  2 151.3579
## fit_glm         3 128.9268

Bayesian

# package does not 'like' tibbles
BayesFactor::regressionBF(formula = murder_rate ~ low_income, 
            data = as.data.frame(murder_sub))

## Bayes factor analysis
## --------------
## [1] low_income : 3429.181 ±0.01%
## 
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

\(t\)-test scenario

IQ data

iq_data = readr::read_csv('data/03_Kruschke_TwoGroupIQ.csv')
summary(iq_data)

##      Score           Group          
##  Min.   : 50.00   Length:120        
##  1st Qu.: 92.75   Class :character  
##  Median :102.00   Mode  :character  
##  Mean   :104.13                     
##  3rd Qu.:112.00                     
##  Max.   :208.00

possible research questions?

is average IQ-score higher than 100 in treatment group?
is average IQ-score higher in treatment group than in control? ( next time)

from Kruschke (2015, Chapter 16)

Case 1: IQ-score higher than 100 in treatment group?

Bayesian GLM:

inspect posterior distribution over \(\beta_0\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, \frac{1}{10000}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \text{N}(0,10000)I[0,] \end{align*} \]

posterior inference: results

## # A tibble: 1 x 4
##   Parameter HDI_lo  mean HDI_hi
##   <fct>      <dbl> <dbl>  <dbl>
## 1 b0          101.  108.   114.

t-test subsumed under GLM

brm_fit_ttest = brm(iq_score_treatment ~ 1, 
                    data = tibble(iq_score_treatment = iq_score_treatment))
brm_fit_ttest

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: iq_score_treatment ~ 1 
##    Data: tibble(iq_score_treatment = iq_score_treatment) (Number of observations: 63) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   107.77      3.17   101.66   114.01       3083 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma    25.66      2.34    21.68    30.66       3323 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

summary

generalized linear model

terminology

\(y\) predicted variable, data, observation, …
\(X\) predictor variables for \(y\), explanatory variables, …

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

outlook

robust regression
GLM with nominal predictors
GLM with nominal & ordinal predicted variables

After

mixed effects models
model comparison by cross-validation

overview

generalized linear model

probabilistic models

Generalized Context Model ::: Model

generalized linear model

types of variables

common link & likelihood function

linear regression

muder rate data set

visualize data

zooming in

linear regression

ordinary linear regression

linear regression: MLE

compare homebrew to the “real thing”

linear regression: a Bayesian approach

linear regression: a Bayesian approach

Using the brms package

Stan model code generated from brms::brm

inference with regression models

inference with regression models

significance of point-estimates of coefficients

credible intervals for posteriors over coefficients

model comparison

\(t\)-test scenario

IQ data

Case 1: IQ-score higher than 100 in treatment group?

posterior inference: results

t-test subsumed under GLM

summary

generalized linear model

outlook

Using the `brms` package

Stan model code generated from `brms::brm`