\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_i & \sim \mathcal{U}(-\infty, \infty) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \mathcal{U}(-\infty, \infty) \end{align*} \]

data { 
  int<lower=1> N;  // total number of observations 
  vector[N] murder_rate;  // dependent variable 
  vector[N] low_income;  // independent variable
} 
parameters { 
  real Intercept;  real beta;  
  real<lower=0> sigma;  
} 
model { 
  // predictor value = expected value
  vector[N] mu = Intercept + low_income * beta;
  // likelihood 
  target += normal_lpdf(murder_rate | mu, sigma);
} 

overview

overview

 

  • linear models with several metric predictors
    • multiplicative interactions
    • correlated predictors
  • robust regression
  • one categorical predictor
  • GLMs with other types of predicted variables
    • binary outcomes: logistic regression
    • nominal outcomes multi-logit regression
    • ordinal outcomes: ordinal (logit/probit) regression
  • case study: logistic regression with multiple categorical predictors

LM with multiple predictors

linear model

data

  • \(y\): \(n \times 1\) vector of predicted variables (metric)

  • \(X\): \(n \times k\) matrix of predictor variables (metric)

parameters

  • \(\beta\): \(k \times 1\) vector of coefficients
  • \(\sigma_{err}\): standard deviation of Gaussian noise

model

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

example

murder_data %>% head
## # A tibble: 6 x 4
##   murder_rate low_income unemployment population
##         <dbl>      <dbl>        <dbl>      <dbl>
## 1        11.2       16.5          6.2     587000
## 2        13.4       20.5          6.4     643000
## 3        40.7       26.3          9.3     635000
## 4         5.3       16.5          5.3     692000
## 5        24.8       19.2          7.3    1248000
## 6        12.7       16.5          5.9     643000
  • predicted variable \(y\): murder_data[,1]
  • predictor matrix \(X\): murder_data[,2:4]

predictor value for two predictor variables

multiple_metric_predictors

transformed predictor variables

 

intercept

\(X_{i,1} = 1\) for all \(i\)

 

multiplicative interaction

\(X_{i,l} = X_{i,j} \cdot X_{i,k}\) for all \(i\) with \(j \neq k\)

transformed variables

\(X_{i,k} = F(X_{i,j})\) (e.g., \(F(x) = \log(x)\) or \(F(x) = x^2\))  

 

crazy terms\(^*\)

\(X_{i,l} = \exp((X_{i,j} - X_{i,k})^2 + \sqrt{X_{i,m}})\)

\(^*\) at which point it gets pointless to speak of a "linear model" still, even if formally subsumable

multiplicative interaction

multiplicative_interactions

recall: murder data

GGally::ggpairs(murder_data, 
                title = "Murder rate data")

multiple predictors

point estimates

glm(murder_rate ~ low_income + population, 
    data = murder_data) %>% 
  broom::tidy()
## # A tibble: 3 x 5
##   term             estimate   std.error statistic    p.value
##   <chr>               <dbl>       <dbl>     <dbl>      <dbl>
## 1 (Intercept) -31.2         8.30           -3.76  0.00156   
## 2 low_income    2.60        0.403           6.44  0.00000615
## 3 population    0.000000420 0.000000767     0.547 0.591

Bayes

BF_fit_1 = BayesFactor::lmBF(
  murder_rate ~ low_income + population, 
  data = murder_data %>% as.data.frame)
post_samples_1 = BayesFactor::posterior(
  model = BF_fit_1, 
  iterations = 5000)

correlated predictors

point estimates

glm(murder_rate ~ low_income + unemployment, 
    data = murder_data) %>% 
  broom::tidy()
## # A tibble: 3 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    -34.1      6.73      -5.07 0.0000956
## 2 low_income       1.22     0.568      2.15 0.0459   
## 3 unemployment     4.40     1.53       2.88 0.0103

Bayes

BF_fit_2 = BayesFactor::lmBF(
  murder_rate ~ low_income + unemployment, 
  data = murder_data %>% as.data.frame)
post_samples_2 = BayesFactor::posterior(
  model = BF_fit_2, 
  iterations = 5000)

known model with (un)correlated predictors

uncorrelated

Kruschke_Fig18_1

correlated

Kruschke_Fig18_1

joint posterior distribution

 

murder_rate ~ low_income + population

murder_rate ~ low_income + unemployment

model comparison

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + unemployment + population,
  data = murder_data %>% as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation (Bayes)

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_group_fit)

method comparison: point-estimates vs. Bayes

murder_data = murder_data %>% mutate(inv_income = -low_income)
glm(murder_rate ~ low_income + inv_income, data = murder_data) %>% broom::tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)   -29.9      7.79      -3.84 0.00120   
## 2 low_income      2.56     0.390      6.56 0.00000364
m = brm(murder_rate ~ low_income + inv_income, data = murder_data)
broom::tidy(m)
##           term  estimate   std.error        lower       upper
## 1  b_Intercept -30.10417    8.442471   -43.810329  -16.042556
## 2 b_low_income 222.92690 3737.368597 -6061.069129 5405.508617
## 3 b_inv_income 220.36852 3737.365239 -6063.784900 5403.503213
## 4        sigma   5.88133    1.367383     4.107823    8.354835
## 5         lp__ -67.65668    1.338055   -70.020305  -66.075633

posterior of perfectly (inversely) correlated predictors

posterior_samples(m) %>% as.tibble() -> n
qplot(x = n$b_low_income, y = n$b_inv_income) +
  xlab("low income") + ylab("inverse low-income")

correlated predictors: summary

 

general problem

correlated predictors distort parameter inference in linear models
(same for point-estimation and Bayes)

 

point estimates

  • perfect correlation \(\Rightarrow\) no unique best fit \(\Rightarrow\) inapplicable
  • strong correlation \(\Rightarrow\) estimation algorithm may implode (silently)

Bayes

  • perfect correlation \(\Rightarrow\) no conceptual problem
  • problem clearly visible in diffuse marginal posterior and correlated joined posterior densities

robust regression

noise model in the likelihood function

 

linear model (so far)

standard likelihood function: normal distribution

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

\(\Rightarrow\) encodes an assumption about noise

 

robust regression

uses Student's \(t\) distribution instead

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{t}(\mu = \eta_i, \sigma_{err}, \nu) \\ \end{align*} \]

Student's \(t\) distribution

 

robustness against outliers

 

Kruscke_Fig16_1

many robust noise models conceivable

 

systematic and independent outlier group

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim (1-\epsilon) \mathcal{N}(\mu = \eta_i, \sigma_{err}) + \epsilon \mathcal{N}(\mu = \mu_{\text{outlier}}, \sigma_{err}) \\ \end{align*} \]

 

scale-dependent noise

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma = a\eta_i + b) \\ \end{align*} \]

caveat: these are just gestures towards possibilities; not necessarily fully functional!

\(t\)-test scenario

IQ data

iq_data = readr::read_csv('data/03_Kruschke_TwoGroupIQ.csv')
summary(iq_data)
##      Score           Group          
##  Min.   : 50.00   Length:120        
##  1st Qu.: 92.75   Class :character  
##  Median :102.00   Mode  :character  
##  Mean   :104.13                     
##  3rd Qu.:112.00                     
##  Max.   :208.00

possible research questions?

  1. is average IQ-score higher than 100 in treatment group?
  2. is average IQ-score higher in treatment group than in control?

from Kruschke (2015, Chapter 16)

Case 1: IQ-score higher than 100 in treatment group?

Bayesian GLM:

inspect posterior distribution over \(\beta_0\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, \frac{1}{10000}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \text{N}(0,10000)I[0,] \end{align*} \]

posterior inference: results

## # A tibble: 1 x 4
##   Parameter HDI_lo  mean HDI_hi
##   <fct>      <dbl> <dbl>  <dbl>
## 1 b0          101.  108.   114.

t-test subsumed under GLM

brm_fit_ttest = brm(iq_score_treatment ~ 1, 
                    data = tibble(iq_score_treatment = iq_score_treatment))
brm_fit_ttest
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: iq_score_treatment ~ 1 
##    Data: tibble(iq_score_treatment = iq_score_treatment) (Number of observations: 63) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   107.85      3.23   101.47   114.25       2824 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma    25.63      2.34    21.48    30.65       2867 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Case 2: difference between groups

Bayesian GLM:

  • predictor variable for group x = ifelse(Group == "Control", 0, 1) (dummy coding)
  • only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma_{err} & \sim \mathcal{N}(0,100)T[0;\infty] \\ \beta_1 & \sim \mathcal{N}(0, 30) \end{align*} \]

posterior inference: results

prior1 = c(set_prior("normal(0,30)", class = "b", coef= "GroupSmartDrug"),
            set_prior("normal(0, 100)", class = "sigma"),
            set_prior("normal(100,15)", class = "Intercept"))

brm_fit = brm(
  formula = Score ~ Group, 
  data = iq_data, 
  prior = prior1
  ) 
##               term    estimate std.error       lower      upper
## 1      b_Intercept  100.100380  2.946309   95.262982  105.05456
## 2 b_GroupSmartDrug    7.618633  4.097243    1.007568   14.36550
## 3            sigma   22.427686  1.489774   20.101028   24.95942
## 4             lp__ -552.498173  1.241103 -554.954430 -551.15690

GLM link functions

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

logistic regression for binomial data

 

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

logistic function

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

dummy

dummy

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

 

  • each datum \(y_i \in \set{1, \dots, k}\), unordered
  • one linear predictor for all categories \(j \in \set{1, \dots, k}\)

 

\[ \begin{align*} \eta_j & = X_i \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

  • each datum \(y_i \in \set{1, \dots, k}\), ordered
  • \(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

summary

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

outlook

 

Tuesday

  • case study (cont.)
  • multi-level / hierarchical / mixed effects regression

 

after that

  • practice