\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
parameter estimation:
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
model comparison
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
prediction
prior predictive
\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]
posterior predictive
\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]
estimation | comparison | criticism | |
---|---|---|---|
goal | which \(\theta\), given \(M\) & \(D\)? | which better: \(M_0\) or \(M_1\)? | \(M\) good model of \(D\)? |
method | Bayes rule | Bayes factor | \(p\)-value |
no. of models | 1 | 2 | 1 |
\(H_0\) | subset of \(\theta\) | \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) | \(P(\theta), P(D \mid \theta)\) |
\(H_1\) | — | \(P(\theta \mid M_1), P(D \mid \theta, M_1)\) | — |
prerequisites | \(P(\theta), \alpha \times P(D \mid \theta)\) | — | test statistic |
pros | lean, easy | intuitive, plausible, Ockham's razor | absolute |
cons | vagueness in ROPE | prior dependence, computational load | sample space? |
NHST
model criticism
compare 3 methods for testing a null hypothesis:
running example:
\(k=7\), \(N=24\) \(\rightarrow\) \(\theta = 0.5?\)
in the general case, the \(p\)-value of observation \(x\) under null hypothesis \(H_0\), with sample space \(X\), sampling distribution \(P(\cdot \mid H_0) \in \Delta(X)\) and test statistic \(t \colon X \rightarrow \mathbb{R}\) is:
\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]
intuitive slogan: probability of at least as extreme outcomes
for an exact test we get:
\[ p(x ; H_0, X, P(\cdot \mid H_0)) = \int_{\left\{ \tilde{x} \in X \ \mid \ P(\tilde{x} \mid H_0) \le P(x \mid H_0) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]
intuitive slogan: probability of at least as unlikely outcomes
notation: \(\Delta(X)\) – set of all probability measures over \(X\)
\[ B(k ; n = 24, \theta = 0.5) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]
binom.test(7,24)$p.value
## [1] 0.06391466
regions of practical equivalence
credible values
NHST by ROPE for our example
\(\theta = 0.5\) is rejectable for all ROPEs with ca. \(\epsilon \le 0.02\)
straightforward
\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{{{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & \approx 0.516 \end{align*} \]
Savage-Dickey
method | result | interpretation |
---|---|---|
\(p\)-value | \(p \approx 0.064\) | do not reject \(H_0\) |
HDI+ROPE | \(\text{HDI} \approx [0.14;0.48]\) | do not adopt \(H_0\) (depends on \(\epsilon\)) |
Bayes Factor | \(\text{BF}(M_0 > M_1) \approx 0.561\) | mini-evidence in favor of \(H_1\) |
"paradox": two established methods give contradictory results
k = 49581 N = 98451
\(p\)-value NHST
binom.test(k, N)$p.value
## [1] 0.02364686
reject \(H_0\)
Savage-Dickey BF
dbeta(0.5, k+1, N - k + 1)
## [1] 19.21139
strong evidence in favor of \(H_0\)
estimation | comparison | criticism | |
---|---|---|---|
\(M_0\) | \([.5-\epsilon, 0.5+\epsilon] \sqsubseteq\) 95% HDI or v.v. | BF(\(M_0\)>\(M_1\)) > 6 | \(p\) > 0.05 |
\(M_1\) | \([.5-\epsilon, 0.5+\epsilon] \, \cap \,\) 95% HDI \(=\emptyset\) | BF(\(M_1\)>\(M_0\)) > 6 | \(p\) <= 0.05 |
?? | otherwise | otherwise | never |
BF selects \(H_0\) correctly with prob. 0.986 for \(N = 10000\), and with 0.996 for \(N = 100000\).
[c.f., Lindley's solution to 'paradox': adjust \(p\) depending on \(N\); similar for ROPE's \(\epsilon\)]
posterior predictive checks
graphically compare simulated observations with actual observation
Bayesian predictive \(p\)-values
measure surprise level of data under a model
[think: \(p\)-value for a non-trivial, serious model with potential uncertainty about parameters]
exponential forgetting model
y = c(.94, .77, .40, .26, .24, .16) t = c( 1, 3, 6, 9, 12, 18) obs = y*100
model{ a ~ dunif(0,1.5) b ~ dunif(0,1.5) for (i in 1: 6){ p[i] = min(max( a*exp(-t[i]*b), 0.0001), 0.9999) obs[i] ~ dbinom(p[i], 100) # condition on data obsRep[i] ~ dbinom(p[i], 100) # replicate fake data } }
\[ p(x ; X, M = \tuple{P(\theta), P(x\mid\theta)}, t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x},\theta) \ge t(x, \theta) \right\}} \int P(\tilde{x} \mid \theta) P(\theta) \ \text{d}\theta \ \text{d}\tilde{x}\]
\[ p(x ; H_0, X, P(\cdot \mid H_0), t) = \int_{\left\{ \tilde{x} \in X \ \mid \ t(\tilde{x}) \ge t(x) \right\}} P(\tilde{x} \mid H_0) \ \text{d}\tilde{x}\]
obs = c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) k = sum(obs) # 7 N = length(obs) #20
\(p\)-value NHST:
binom.test(k, N, 0.5)$p.value
## [1] 0.263176
Bayesian posterior predictive \(p\)-value
Gelman et al. 2014, p.147–8
estimation | comparison | criticism | |
---|---|---|---|
goal | which \(\theta\), given \(M\) & \(D\)? | which better: \(M_0\) or \(M_1\)? | \(M\) good model of \(D\)? |
method | Bayes rule | Bayes factor | \(p\)-value |
no. of models | 1 | 2 | 1 |
\(H_0\) | subset of \(\theta\) | \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) | \(P(\theta), P(D \mid \theta)\) |
\(H_1\) | — | \(P(\theta \mid M_1), P(D \mid \theta, M_1)\) | — |
prerequisites | \(P(\theta), \alpha \times P(D \mid \theta)\) | — | test statistic |
pros | lean, easy | intuitive, plausible, Ockham's razor | absolute |
cons | vagueness in ROPE | prior dependence, computational load | sample space? |
Friday
Tuesday