Hierarchical modeling

Michael Franke

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

a hierarchical coin flip model

recap: inferring a coin’s bias with Stan

Simple model

data { // declare data variables
  int<lower=0> N ;
  int<lower=0,upper=N> k ;
}
parameters { // declare parameter variables
  real<lower=0,upper=1> theta ;
} 
model { // prior and likelihood
  theta ~ beta(1,1) ;
  k ~ binomial(N, theta) ;
}

we modelers have uncertainty about $\theta$
each value for $ $ is equally likely

model as Bayesian network

model_graph

a hierarchical extension

Hierarchical model

data {
  int<lower=0> N ;
  int<lower=0,upper=N> k ;
}
parameters {
  real<lower=0> a ;
  real<lower=0> b ;
  real<lower=0,upper=1> theta ;
} 
model {
  a ~ gamma(1,1) ;
  b ~ gamma(1,1) ;
  theta ~ beta(a,b) ;
  k ~ binomial(N, theta) ;
}

we modelers have uncertainty about $\theta$
we also have higher-order uncertainty about how likely each $P(\theta)$ is

model as Bayesian network

model_graph

results

data {
  int<lower=0> N ; int<lower=0,upper=N> k ;
}
parameters {
  real<lower=0> a ;
  real<lower=0> b ;
  real<lower=0,upper=1> theta ;
} 
model {
  a ~ gamma(1,1) ;
  b ~ gamma(1,1) ;
  theta ~ beta(a,b) ;
  k ~ binomial(N, theta) ;
}
generated quantities {
  real<lower=0> a_prior ;
  real<lower=0> b_prior ;
  real<lower=0,upper=1> theta_prior ;
  a_prior = gamma_rng(1,1) ;
  b_prior = gamma_rng(1,1) ;
  theta_prior = beta_rng(a_prior,b_prior) ;
}

plot(stan_fit, plotfun = 'dens', ncol = 1,
     pars = c("a", "b", "theta"))

hierarchical models

latent variables can depend on other latent variables
captures modeler’s higher-order uncertainty and important structural relations between variables
likelihood can be a (direct) function of only some latent variables

\[ \begin{align*} P(\theta, a, b \, | \, D) & \propto P(\theta, a, b) \ P(D \, | \, \theta, a, b)\\ & = P(a) \ P(b) \ P(\theta \mid a, b) \ P(D \, | \, \theta) \end{align*} \]

dummy dummy

probMod

deterministic dependencies

probMod

\[ \begin{align*} P(\theta, s \, | \, D) & \propto P(\theta, s) \ P(D \, | \, \theta, s)\\ & = P(s) \ \underbrace{P(\theta \mid s)}_{\text{degenerate!}} \ P(D \, | \, \theta) \end{align*} \]

CAVEAT: this is tricky to implement in Stan (why?)

graph conventions

probMod

type of variable:
- continuous: circle
- discrete: square
information status:
- observed: grey shade
- latent: white
dependency:
- stochastic: single line
- deterministic: double line
indeces & boxes:
- vectors, matrices, arrays

Lee & Wagenmakers (2013)

priors in hierarchical models

priors are complex objects
should make sense at every level in the hierarchy:
- check dependent priors of intermediate latent variables!
can have interesting effects
- e.g., shrinkage (later in this class)

revisit previous example

// (...)
model {
  a ~ gamma(1,1) ;
  b ~ gamma(1,1) ;
  theta ~ beta(a,b) ;
  k ~ binomial(N, theta) ;
}
generated quantities {
  // (...)
  a_prior = gamma_rng(1,1) ;
  b_prior = gamma_rng(1,1) ;
  theta_prior = beta_rng(a_prior,b_prior) ;
}

probMod

plot(stan_fit, plotfun = 'dens', ncol = 1,
     pars = c("a_prior", "b_prior", "theta_prior"))

reparameterization

$\text{Beta}(a,b)$ has mean $\mu$ and size $v$: \[ \begin{align*} \mu & = \frac{a}{a+b} & v &= a + b \end{align*} \]
equivalently: \[ \begin{align*} a & = \mu \ v & b & = (1-\mu) v \end{align*} \]
using $\mu$ and $v$ is more intuitive / convenient

parameters {
  real<lower=0,upper=1> mu ; 
  real<lower=1,upper=100> v ;
  real<lower=0,upper=1> theta ;
} 
transformed parameters {
  real<lower=0> a ; real<lower=0> b ;
  a = mu*v ; b = (1-mu)*v ;
}
model {
  mu ~ beta(1,1) ; theta ~ beta(a,b) ;
  k ~ binomial(N,theta) ;
}
generated quantities {
  real<lower=0> mu_prior ;
  real<lower=1,upper=100> v_prior ;
  real<lower=0> a_prior; real<lower=0> b_prior;
  real<lower=0,upper=1> theta_prior;
  mu_prior = beta_rng(1,1) ;
  v_prior = uniform_rng(1,100);
  a_prior = mu_prior * v_prior;
  b_prior = (1- mu_prior) * v_prior ;
  theta_prior = beta_rng(a_prior,b_prior);
}

puzzle: where is the prior for $v$?

results

plot(stan_fit, 
 plotfun="dens", 
 nrow=2)

latent mixture models

latent mixtures

hierarchical models with latent parameters:
- individual-level differences
- item-level differences
- …

probMod

think: mixed effects regression

therapeutic touchers

therapeuticTouches

TT practitioners sense energy fields
they sense which hand is near another persons body

(Kruschke, chapter 9.2.4)

TT data

TTdata = readr::read_csv('data/04_therapeutic_touchers.csv') %>% 
         group_by(s) %>% 
         summarize(k = sum(y), N = length(y))
# k - number of successes per subject
# N - number of trials per subject
head(TTdata)

## # A tibble: 6 x 3
##   s         k     N
##   <chr> <int> <int>
## 1 S01       1    10
## 2 S02       2    10
## 3 S03       3    10
## 4 S04       3    10
## 5 S05       3    10
## 6 S06       3    10

ggplot(TTdata, aes(x  = s, y = k/N)) + 
  geom_bar(stat='identity')

latent mixture model

data {
  int<lower=0> N[28] ;
  int<lower=0> k[28] ;
}
parameters {
  real<lower=0,upper=1> mu ;
  real<lower=1,upper=100> v ;
  real<lower=0,upper=1> theta[28] ;
} 
transformed parameters {
  real<lower=0> a;
  real<lower=0> b;
  a = mu * v;
  b = (1- mu) * v ;
}
model {
  mu ~ beta(1,1) ;
  theta ~ beta(a,b) ;
  k ~ binomial(N,theta) ;
}

probMod

posterior on subject-level guessing parameters

samples = ggs(stan_fit_TT) %>%  filter(! Parameter %in% c("mu", "v", "a", "b"))
samples$subject = with(samples,factor(regmatches(Parameter, regexpr("[0-9]+", Parameter)), levels = 1:28)) 
ggplot(samples, aes(x = subject, y = value)) + geom_boxplot() + ylab("bias level (posterior)")

shrinkage

what’s going on?

shrinkage vs. population-level generalization

careful: “shrinkage” is a misleading term!
- shrinkage of variance in low-level parameters (compared to non-hierarchical)
- variance can also increase, depending on the particular model
hierarchical modeling can “bind together” low-level parameters
- this is an additional assumption (can be good or bad; true or false)
hierarchical “binding” reduces “freedom” of free variables
- can guard us against overfitting
hierarchical group-level binding gives us population-level estimates
- generalize to new observations

discrete latent parameters

problem

Stan needs gradients / derivatives
cannot infer discrete latent parameters

(sometimes) practical solution

marginalize out

probMod

exam-scores example

15 people answered 40 exam questions. Here’s their success rates:

assume that some guessed randomly, others gave informed answers
who is a guesser?
what’s the probability of correct answering for non-guessers?

latent discrete mixture model

modelGraph

Lee & Wagenmakers, Chapter 6.1

latent discrete mixture model

modelGraph

data { 
  int<lower=1> p;  int<lower=0> k[p];   int<lower=1> n;
}
transformed data {
  real psi; psi <- .5;
}
parameters {
  real<lower=.5,upper=1> phi; 
} 
transformed parameters {
  vector[2] log_alpha[p];
  for (i in 1:p) {
    log_alpha[i,1] <- log(.5) + binomial_log(k[i], n, phi);
    log_alpha[i,2] <- log(.5) + binomial_log(k[i], n, psi); 
  }
}
model {
  for (i in 1:p)
    increment_log_prob(log_sum_exp(log_alpha[i]));  
}
generated quantities {
  int<lower=0,upper=1> z[p];
  for (i in 1:p) {
    z[i] <- bernoulli_rng(softmax(log_alpha[i])[1]);
  }
}

log-scores & marginalizing out

log score

$\log P(D, \theta) = \log \left( P(\theta) \ P(D \mid \theta) \right ) = \log P(\theta) + \log P(D \mid \theta)$
log-score for one participant in our example:

\[\log P(k_i, n, z_i, \phi) = \log \text{Bern}(z_i \mid 0.5) + \log \text{Binom}(k_i, n, \text{ifelse}(z=1, \phi, 0.5))\]

marginalizing out

\[ \begin{align*} P(k_i, n, \phi) & = \sum_{z_i \in \set{0,1}} \text{Bern}(z_i \mid 0.5) \ \text{Binom}(k_i, n, \text{ifelse}(z=1, \phi, 0.5)) \\ & = \text{Bern}(0 \mid 0.5) \ \text{Binom}(k_i, n, 0.5)) + \\ & \ \ \ \ \ \ \ \ \ \ \text{Bern}(1 \mid 0.5) \ \text{Binom}(k_i, n, \phi)) \\ & = \underbrace{0.5 \ \text{Binom}(k_i, n, 0.5))}_{\alpha_0} + \underbrace{0.5 \ \text{Binom}(k_i, n, \phi))}_{\alpha_1} \\ \log P(k_i, n, \phi) & = \log \sum_{i=0}^1 \exp \log \alpha_i \end{align*}\]

Stan model: results

posterior

data & model

modelGraph

fini

outlook

Thursday

latent mixture models in Stan based on Lee & Wagenmakers (2015) ch. 6
- relevant Stan code is provided here

in 2019

model comparison
model criticism
generalized mixed effects regression