\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

recap

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

linear regression: a Bayesian approach

Bayes: likelihood + prior

inspect posterior distribution over \(\beta_0\), \(\beta_1\) and \(\sigma_{\text{err}}\) given the data \(y\) and the model:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_i & \sim \mathcal{N}(0, \sigma_{\beta}) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(0, 1/10000000)
  b1 ~ dnorm(0, 1/10000000)
  for (i in 1:k){
    yPred[i] = b0 + b1 * x[i]
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

overview

overview

 

  • linear models with several metric predictors
    • correlated predictors
  • robust regresssion
  • GLMs with other types of predictor variables
    • binary outcomes: logistic regression
    • nominal outcomes mulit-logit regression
    • ordinal outcomes: ordinal (logit/probit) regression

LM with multiple predictors

linear model

data

  • \(y\): \(n \times 1\) vector of predicted values (metric)

  • \(X\): \(n \times k\) matrix of predictor values (metric)

parameters

  • \(\beta\): \(k \times 1\) vector of coefficients
  • \(\sigma_{err}\): standard deviation of Gaussian noise

model

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

example

murder_data %>% head
## # A tibble: 6 x 4
##   murder_rate low_income unemployment population
##         <dbl>      <dbl>        <dbl>      <int>
## 1        11.2       16.5          6.2     587000
## 2        13.4       20.5          6.4     643000
## 3        40.7       26.3          9.3     635000
## 4         5.3       16.5          5.3     692000
## 5        24.8       19.2          7.3    1248000
## 6        12.7       16.5          5.9     643000
  • predicted variable \(y\): murder_data[,1]
  • predictor matrix \(X\): murder_data[,2:3]

predictor variables beyond the raw data

 

intercept

\(X_{i,1} = 1\) for all \(i\)

 

interaction terms

\(X_{i,l} = X_{i,j} \cdot X_{i,k}\) for all \(i\) with \(j \neq k\)

transformed variables

\(X_{i,k} = F(X_{i,j})\) (e.g., \(F(x) = \log(x)\) or \(F(x) = x^2\))  

 

crazy terms\(^*\)

\(X_{i,l} = \exp((X_{i,j} - X_{i,k})^2 + \sqrt{X_{i,m}})\)

\(^*\) at which point it gets pointless to speak of a "linear model" still, even if formally subsumable

recall: murder data

GGally::ggpairs(murder_data, 
                title = "Murder rate data")

multiple predictors

point estimates

glm(murder_rate ~ low_income + population, 
    data = murder_data) %>% 
  broom::tidy()
##          term      estimate    std.error statistic      p.value
## 1 (Intercept) -3.121522e+01 8.300600e+00 -3.760598 0.0015586420
## 2  low_income  2.595519e+00 4.032796e-01  6.436029 0.0000061507
## 3  population  4.198061e-07 7.674560e-07  0.547010 0.5914808122

Bayes

BF_fit_1 = BayesFactor::lmBF(
  murder_rate ~ low_income + population, 
  data = murder_data %>% as.data.frame)
post_samples_1 = BayesFactor::posterior(
  model = BF_fit_1, 
  iterations = 5000)

correlated predictors

point estimates

glm(murder_rate ~ low_income + unemployment, 
    data = murder_data) %>% 
  broom::tidy()
##           term   estimate std.error statistic      p.value
## 1  (Intercept) -34.072533 6.7265451 -5.065384 9.558945e-05
## 2   low_income   1.223931 0.5682047  2.154031 4.587777e-02
## 3 unemployment   4.398936 1.5261685  2.882340 1.034228e-02

Bayes

BF_fit_2 = BayesFactor::lmBF(
  murder_rate ~ low_income + unemployment, 
  data = murder_data %>% as.data.frame)
post_samples_2 = BayesFactor::posterior(
  model = BF_fit_2, 
  iterations = 5000)

known model with (un)correlated predictors

uncorrelated

Kruschke_Fig18_1

correlated

Kruschke_Fig18_1

joint posterior distribution

 

murder_rate ~ low_income + population

murder_rate ~ low_income + unemployment

model comparison

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + unemployment + population,
  data = murder_data %>% as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation

BF_group_fit = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_group_fit)

model comparison: perfect correlation

BF_complex = BayesFactor::regressionBF(
  murder_rate ~ low_income + inv_income,
  rscaleCont = 5, # set a wider prior for all slopes
  data = murder_data %>% 
    mutate(inv_income = -low_income) %>% 
    as.data.frame)
plot(BF_complex)

correlated predictors: summary

 

general problem

correlated predictors distort parameter inference in linear models
(same for point-estimation and Bayes)

 

point estimates

  • perfect correlation \(\Rightarrow\) no unique best fit \(\Rightarrow\) inapplicable
  • strong correlation \(\Rightarrow\) estimation algorithm may implode

Bayes

  • perfect correlation \(\Rightarrow\) no conceptual problem
  • computation can be harder but priors can mediate

robust regression

noise model in the likelihood function

 

linear model (so far)

standard likelihood function: normal distribution

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma_{err}) \\ \end{align*} \]

\(\Rightarrow\) encodes an assumption about noise

 

robust regression

uses Student's \(t\) distribution instead

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{t}(\mu = \eta_i, \sigma_{err}, \nu) \\ \end{align*} \]

Student's \(t\) distribution

 

robustness against outliers

 

Kruscke_Fig16_1

many robust noise models conceivable

 

systematic and independent outlier group

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim (1-\epsilon) \mathcal{N}(\mu = \eta_i, \sigma_{err}) + \epsilon \mathcal{N}(\mu = \mu_{\text{outlier}}, \sigma_{err}) \\ \end{align*} \]

 

scale-dependent noise

\[ \begin{align*} \eta & = X_i \cdot \beta & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y_i & \sim \mathcal{N}(\mu = \eta_i, \sigma = a\eta_i + b) \\ \end{align*} \]

caveat: these are just gestures towards possibilities; not necessarily fully functional!

\(t\)-test scenario

IQ data

iq_data = readr::read_csv('../data/07_Kruschke_TwoGroupIQ.csv')
summary(iq_data)
##      Score           Group          
##  Min.   : 50.00   Length:120        
##  1st Qu.: 92.75   Class :character  
##  Median :102.00   Mode  :character  
##  Mean   :104.13                     
##  3rd Qu.:112.00                     
##  Max.   :208.00

possible research questions?

  1. is average IQ-score higher than 100 in treatment group?
  2. is average IQ-score higher in treatment group than in control?

from Kruschke (2015, Chapter 16)

Case 1: IQ-score higher than 100 in treatment group?

Bayesian GLM:

only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(100, 1/15^2)
  for (i in 1:k){
    yPred[i] = b0
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

posterior inference: results

## # A tibble: 1 x 4
##   Parameter   HDI_lo     mean   HDI_hi
##      <fctr>    <dbl>    <dbl>    <dbl>
## 1        b0 96.11298 102.8397 109.3491

Case 2: difference between groups

Bayesian GLM:

  • predictor variable for group x = ifelse(Group == "Control", 0, 1) (dummy coding)
  • only coefficient is intercept \(\beta_0\), with prior knowledge encoded in priors:

\[ \begin{align*} y_{\text{pred}} & = \beta_0 + \beta_1 x & \ \ \ \ \ \ \ \ \ \ \ \ \ \ y & \sim \mathcal{N}(\mu = y_{\text{pred}}, \sigma_{err}) \\ \beta_0 & \sim \mathcal{N}(100, 15) & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{1}{\sigma_{err}^2} & \sim \text{Gamma}(0.1,0.1) \\ \beta_1 & \sim \mathcal{N}(0, 30) \end{align*} \]

model{
  sigma_e = 1/sqrt(tau_e)
  tau_e ~ dgamma(0.1,0.1)
  b0 ~ dnorm(100, 1/1/15^2)
  b1 ~ dnorm(0, 1/1/30^2)
  for (i in 1:k){
    yPred[i] = b0
    y[i] ~ dnorm(yPred[i], tau_e)
  }
}

posterior inference: results

## # A tibble: 2 x 4
##   Parameter     HDI_lo       mean    HDI_hi
##      <fctr>      <dbl>      <dbl>     <dbl>
## 1        b0 94.4408927 100.098591 105.73580
## 2        b1 -0.3854978   7.678485  15.38381

GLM link functions

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

common link & likelihood function

logistic regression for binomial data

 

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu &= \text{logistic}(\eta, \theta = 0, \gamma = 1) = (1 + \exp(-\eta))^{-1} \\ y & \sim \text{Bernoulli}(\mu) \end{align*} \]

logistic function

\[\text{logistic}(\eta, \theta, \gamma) = \frac{1}{(1 + \exp(-\gamma (\eta - \theta)))}\]

dummy

dummy

threshold \(\theta\)

gain \(\gamma\)

multi-logit regression for multinomial data

 

  • each datum \(y_i \in \set{1, \dots, k}\), unordered
  • one linear predictor for all categories \(j \in \set{1, \dots, k}\)

 

\[ \begin{align*} \eta_j & = X_i \cdot \beta_j \\ \mu_j & \propto \exp(\eta_j) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

ordinal (probit) regression

  • each datum \(y_i \in \set{1, \dots, k}\), ordered
  • \(k+1\) latent threshold parameters: \(\infty = \theta_0 < \theta_1 < \dots < \theta_k = \infty\)

\[ \begin{align*} \eta & = X_i \cdot \beta \\ \mu_j & = \Phi(\theta_{j} - \eta) - \Phi(\theta_{j+1} - \eta) \\ y & \sim \text{Categorical}(\mu) \end{align*} \]

threshold-Phi model

threshPhi

summary

generalized linear model

 

terminology

  • \(y\) predicted variable, data, observation, …
  • \(X\) predictor variables for \(y\), explanatory variables, …

 

blueprint of a GLM

\[ \begin{align*} \eta & = \text{linear_combination}(X) \\ \mu & = \text{link_fun}( \ \eta, \theta_{\text{LF}} \ ) \\ y & \sim \text{lh_fun}( \ \mu, \ \theta_{\text{LH}} \ ) \end{align*} \]

glm_scheme

outlook

 

Tuesday

  • packages & tools for Bayesian regression
  • multi-level / hierarchical / mixed effects regression
  • WAIC & LOO

 

Friday

  • rounding off
  • projects