Bayesian data analysis & cognitive modeling

\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

introduction

3 pillars of BDA (recap)

parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

prediction

prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]

posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

why model predictions?

model-based beliefs about what is likely to happen
- practical decision making
model comparison
- information criteria
- Bayes factor
model criticism
- is the model any good?
- given belief in the model, should we be shocked by this new observation?
  [ \(p\)-values ]

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

road map for today

NHST

inference, HDIs & ROPEs
nested model comparison
\(p\)-values

model criticism

posterior predictive checks
prior/posterior \(p\)-values

testing a null

up next

compare 3 methods for testing a null hypothesis:

\(p\)-values
parameter inference with ROPEs
nested model comparison

running example:

\(k=7\), \(N=24\) \(\rightarrow\) \(\theta = 0.5?\)

boring

definition \(p\)-value

in the general case, the \(p\)-value of observation \(x\) under null hypothesis \(H_0\), with sample space \(X\), sampling distribution \(P(\cdot \mid H_0) \in \Delta(X)\) and test statistic \(t \colon X \rightarrow \mathbb{R}\) is:

\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]

intuitive slogan: probability of at least as extreme outcomes

for an exact test we get:

\[ p(x ; H_0, X, P(\cdot \mid H_0)) = \int_{\left\{ \tilde{x} \in X \ \mid \ P(\tilde{x} \mid H_0) \le P(x \mid H_0) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]

intuitive slogan: probability of at least as unlikely outcomes

notation: \(\Delta(X)\) – set of all probability measures over \(X\)

NHST by \(p\)-value

data: we flip \(n=24\) times and observe \(k = 7\) successes
null hypothesis: \(\theta = 0.5\)
sampling distribution: binomial distribution

\[ B(k ; n = 24, \theta = 0.5) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]

binom.test(7,24)$p.value

## [1] 0.06391466

posterior inference

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: estimate \(P(\theta \mid D)\) & determine posterior 95% HDI

ROPEs and credible values

regions of practical equivalence

small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
- values (practically) indistinguishable from \(\theta\)

credible values

value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI

NHST by ROPE for our example

\(\theta = 0.5\) is rejectable for all ROPEs with ca. \(\epsilon \le 0.02\)

Bayes factors for NHST

\(M_0\): \(\theta = 0.5\) & \(k \sim \text{Binomial}(0.5, N)\)
\(M_1\): \(\theta \sim \text{Beta}(1,1)\) & \(k \sim \text{Binomial}(\theta, N)\)

straightforward

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{{{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & \approx 0.516 \end{align*} \]

Savage-Dickey

summary

method	result	interpretation
\(p\)-value	\(p \approx 0.064\)	do not reject \(H_0\)
HDI+ROPE	\(\text{HDI} \approx [0.14;0.48]\)	do not adopt \(H_0\) (depends on \(\epsilon\))
Bayes Factor	\(\text{BF}(M_0 > M_1) \approx 0.561\)	mini-evidence in favor of \(H_1\)

comparison

Jeffreys-Lindley paradox

"paradox": two established methods give contradictory results

k = 49581
N = 98451

\(p\)-value NHST

binom.test(k, N)$p.value

## [1] 0.02364686

reject \(H_0\)

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)

## [1] 19.21139

strong evidence in favor of \(H_0\)

simulation

let the true bias be \(\theta = 0.5\)
generate all possible outcomes \(k\) keeping \(N\) fixed
- \(N \in \{ 10, 100, 1000, 10000, 100000 \}\)
- true frequency of \(k\) is \(Binomial(k \mid N, \theta = 0.5)\)
look at the frequency of test results, coded thus:

	estimation	comparison	criticism
\(M_0\)	\([.5-\epsilon, 0.5+\epsilon] \sqsubseteq\) 95% HDI or v.v.	BF(\(M_0\)>\(M_1\)) > 6	\(p\) > 0.05
\(M_1\)	\([.5-\epsilon, 0.5+\epsilon] \, \cap \,\) 95% HDI \(=\emptyset\)	BF(\(M_1\)>\(M_0\)) > 6	\(p\) <= 0.05
??	otherwise	otherwise	never

(more on this here)

results

BF selects \(H_0\) correctly with prob. 0.986 for \(N = 10000\), and with 0.996 for \(N = 100000\).

[c.f., Lindley's solution to 'paradox': adjust \(p\) depending on \(N\); similar for ROPE's \(\epsilon\)]

model criticism

motivation

parameter estimation: what \(\theta\) to believe in?
model comparison: which model is better than another?
model criticism: is a given model plausible (enough)?

posterior predictive checks

graphically compare simulated observations with actual observation

Bayesian predictive \(p\)-values

measure surprise level of data under a model

[think: \(p\)-value for a non-trivial, serious model with potential uncertainty about parameters]

posterior predictive checks

exponential forgetting model

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

model{
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    p[i] = min(max( a*exp(-t[i]*b), 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)    # condition on data
    obsRep[i] ~ dbinom(p[i], 100) # replicate fake data
  }
}

PPC: exponential model

black dots: data
blue dots: mean of replicated fake data
blue bars: 95% HDIs of replicated fake data

PPC: power model

black dots: data
blue dots: mean of replicated fake data
blue bars: 95% HDIs of replicated fake data

Bayesian predictive model criticism

fix a data set \(x\) from a set of possible observations \(X\)
fix a model \(M\) with \(P(\theta)\) and \(P(X = x \mid \theta)\)
- NB: \(P(\theta)\) can be conditioned on data: prior/posterior predictive \(p\)-value
fix a test statistic \(t \colon X, \theta \rightarrow \mathbb{R}\)
- test statistic may depend on parameters
Bayesian predictive \(p\)-value:

\[ p(x ; X, M = \tuple{P(\theta), P(x\mid\theta)}, t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x},\theta) \ge t(x, \theta) \right\}} \int P(\tilde{x} \mid \theta) P(\theta) \ \text{d}\theta \ \text{d}\tilde{x}\]

compare to standard \(p\)-value definition:

\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]

example

obs = c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
k = sum(obs) # 7
N = length(obs) #20

\(p\)-value NHST:

do not reject NH \(\theta = 0.5\)

binom.test(k, N, 0.5)$p.value

## [1] 0.263176

Bayesian posterior predictive \(p\)-value

test statistic: no. switches 1 <-> 0
\(t(d^*)\) = 2
PP-\(p\)-value \(\approx 0.028\)

pppvalue

Gelman et al. 2014, p.147–8

summary

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

outlook

Friday

bootcampling GCM (Lee & Wagenmakers ch. 17)

Tuesday

introduction to Stan