3 pillars of BDA (recap)

parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]


model comparison

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]



prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]


posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

why model predictions?


  1. model-based beliefs about what is likely to happen
    • practical decision making
  2. model comparison
    • information criteria
    • Bayes factor
  3. model criticism
    • is the model any good?
    • given belief in the model, should we be shocked by this new observation?
      [ \(p\)-values ]


estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

road map for today



  • inference, HDIs & ROPEs
  • nested model comparison
  • \(p\)-values

model criticism

  • posterior predictive checks
  • prior/posterior \(p\)-values

testing a null

up next


compare 3 methods for testing a null hypothesis:

  1. \(p\)-values
  2. parameter inference with ROPEs
  3. nested model comparison


running example:

\(k=7\), \(N=24\) \(\rightarrow\) \(\theta = 0.5?\)


definition \(p\)-value


in the general case, the \(p\)-value of observation \(x\) under null hypothesis \(H_0\), with sample space \(X\), sampling distribution \(P(\cdot \mid H_0) \in \Delta(X)\) and test statistic \(t \colon X \rightarrow \mathbb{R}\) is:

\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]

intuitive slogan: probability of at least as extreme outcomes


for an exact test we get:

\[ p(x ; H_0, X, P(\cdot \mid H_0)) = \int_{\left\{ \tilde{x} \in X \ \mid \ P(\tilde{x} \mid H_0) \le P(x \mid H_0) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]

intuitive slogan: probability of at least as unlikely outcomes

notation: \(\Delta(X)\) – set of all probability measures over \(X\)

NHST by \(p\)-value

  • data: we flip \(n=24\) times and observe \(k = 7\) successes
  • null hypothesis: \(\theta = 0.5\)
  • sampling distribution: binomial distribution

\[ B(k ; n = 24, \theta = 0.5) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]


## [1] 0.06391466

posterior inference

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: estimate \(P(\theta \mid D)\) & determine posterior 95% HDI

ROPEs and credible values


regions of practical equivalence

  • small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
    • values (practically) indistinguishable from \(\theta\)


credible values

  • value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
  • value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI


NHST by ROPE for our example

\(\theta = 0.5\) is rejectable for all ROPEs with ca. \(\epsilon \le 0.02\)

Bayes factors for NHST

  • \(M_0\): \(\theta = 0.5\) & \(k \sim \text{Binomial}(0.5, N)\)
  • \(M_1\): \(\theta \sim \text{Beta}(1,1)\) & \(k \sim \text{Binomial}(\theta, N)\)



\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{{{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & \approx 0.516 \end{align*} \]




Jeffreys-Lindley paradox

"paradox": two established methods give contradictory results


k = 49581
N = 98451

\(p\)-value NHST

binom.test(k, N)$p.value
## [1] 0.02364686

reject \(H_0\)

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)
## [1] 19.21139

strong evidence in favor of \(H_0\)


  • let the true bias be \(\theta = 0.5\)
  • generate all possible outcomes \(k\) keeping \(N\) fixed
    • \(N \in \{ 10, 100, 1000, 10000, 100000 \}\)
    • true frequency of \(k\) is \(Binomial(k \mid N, \theta = 0.5)\)
  • look at the frequency of test results, coded thus:


(more on this here)


BF selects \(H_0\) correctly with prob. 0.986 for \(N = 10000\), and with 0.996 for \(N = 100000\).

[c.f., Lindley's solution to 'paradox': adjust \(p\) depending on \(N\); similar for ROPE's \(\epsilon\)]

model criticism


  • parameter estimation: what \(\theta\) to believe in?
  • model comparison: which model is better than another?
  • model criticism: is a given model plausible (enough)?


posterior predictive checks

graphically compare simulated observations with actual observation

Bayesian predictive \(p\)-values

measure surprise level of data under a model

[think: \(p\)-value for a non-trivial, serious model with potential uncertainty about parameters]

posterior predictive checks

exponential forgetting model

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    p[i] = min(max( a*exp(-t[i]*b), 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)    # condition on data
    obsRep[i] ~ dbinom(p[i], 100) # replicate fake data

PPC: exponential model

  • black dots: data
  • blue dots: mean of replicated fake data
  • blue bars: 95% HDIs of replicated fake data

PPC: power model

  • black dots: data
  • blue dots: mean of replicated fake data
  • blue bars: 95% HDIs of replicated fake data

Bayesian predictive model criticism

  • fix a data set \(x\) from a set of possible observations \(X\)
  • fix a model \(M\) with \(P(\theta)\) and \(P(X = x \mid \theta)\)
    • NB: \(P(\theta)\) can be conditioned on data: prior/posterior predictive \(p\)-value
  • fix a test statistic \(t \colon X, \theta \rightarrow \mathbb{R}\)
    • test statistic may depend on parameters
  • Bayesian predictive \(p\)-value:

\[ p(x ; X, M = \tuple{P(\theta), P(x\mid\theta)}, t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x},\theta) \ge t(x, \theta) \right\}} \int P(\tilde{x} \mid \theta) P(\theta) \ \text{d}\theta \ \text{d}\tilde{x}\]


  • compare to standard \(p\)-value definition:

\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]


obs = c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
k = sum(obs) # 7
N = length(obs) #20

\(p\)-value NHST:

  • do not reject NH \(\theta = 0.5\)
binom.test(k, N, 0.5)$p.value
## [1] 0.263176

Bayesian posterior predictive \(p\)-value

  • test statistic: no. switches 1 <-> 0
  • \(t(d^*)\) = 2
  • PP-\(p\)-value \(\approx 0.028\)


Gelman et al. 2014, p.147–8




  • introduction to Stan