Bayesian data analysis

At a glance

BDA is about what we should believe given:
- some observable data, and
- our model of how this data was generated.
Our best friend will be Bayes rule: \[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \times \underbrace{P(D \, | \, \theta)}_{likelihood}\]
If \(P(\theta \, | \, D)\) is hard to compute, we resort to ~~magic~~ some clever stuff.

reverendB

Example: coin flips

\(\theta \in [0;1]\) is the bias of a coin:
- if we throw a coin, the outcome will be heads with probability \(\theta\)
we have no clue about \(\theta\) at the outset:
- a priori we consider every possible value of \(\theta\) equally likely
we observe that 7 out of 24 flips 7 were heads
what shall we believe about \(\theta\) now?

"Classical statistics"

null hypothesis significance testing (NHST)
- e.g., is the coin fair (\(\theta = 0.5\))
signals if the NH should be rejected
- not: how likely it is or if it is to be accepted
relies on sampling distributions & p-values
- standard "tests" can have rigid built-in assumptions
- implicitly rely on experimenter's intentions
looks at point estimates only

significance

Pros & Cons of BDA

Pro

well-founded & totally general
easily extensible / customizable
more informative / insightful

Drawing

Con

less ready-made, thinking required
not yet fully digested by community
higher computational complexity

Drawing2

3 times Bayes

Bayesian data analysis - Bayesian analogues or alternatives to "classical" tests

Bayesian (cognitive) modeling - custom models of the data-generating process

Bayes in the head - model (human) cognition as Bayesian inference

Goals of this course

to understand basic ideas of BDA (contrast with NHST)

to be able to read current literature on BDA
- further reading suggestions on BDA

to be able to start using tools for BDA
- pointers to useful tools for BDA
- there are also some (of course: optional) exercises

to see how BDA blends seamlessly into cognitive modeling

The road ahead

theory

posterior inference & credible values
Bayes factors & model comparison
comparison of Bayesian and "classical" NHST

practice

basics of MCMC sampling
tools for BDA ( JAGS, Stan (rstanarm), WebPPL, Jasp )
Bayesian cognitive modeling example

NHST & \(p\)-value logic

binomial distribution

take \(N\) flips of a coin with bias \(\theta\)
binomial distribution gives the probability of observing \(k\) successes:

\[P(k \mid N, \theta) = {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

example: \(N=24\), \(\theta = 0.5\)

NHST \(p\)-value logic

we observed \(k=7\) successes after \(N=24\) flips
null hypothesis: the coin is fair, i.e., \(\theta = 0.5\)
the \(p\)-value of \(k=7\) is the probability of observing an outcome that is at least as unlikely as \(k=7\) under the NH in infinite repetitions of the experiment
significance: reject NH if \(p\)-value is under a predetermined threshold (e.g., 0.05)

negative binomial distribution

flip a coin with bias \(\theta\) until we have observed \(k\) successes
negative binomial distribution gives the probability of observing \(N\) flips:

\[P(N \mid k, \theta) = \frac{z}{N} \, {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

example: \(k=7\), \(\theta = 0.5\)

another \(p\)-value for "same" data set

we observed \(N=24\) flips for a success count of \(k=7\)
- NB: same data set as before but obtained differently

some properties of \(p\)-value NHST

intensional

hinges on a what it means to repeat the experiment
- "model" = data-generating (e.g., psychological) + data-collecting processes

non-doxastic

non-significance \(\neq\) evidence for "the" alternative hypothesis
- no information about any alternative hypothesis is used anywhere
significance \(\neq\) evidence for NH
- evidence is a relative notion: shifting plausibility between hypotheses

Bayesian basics

key notions

conditional probability:

\[P(X \, | \, Y) = \frac{P(X \cap Y)}{P(Y)}\]

Bayes rule:

\[P(X \, | \, Y) = \frac{P(X) \times P(Y \, | \, X)}{P(Y)}\]

Bayes rule for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} = \frac{\overbrace{P(\theta)}^{prior} \times \overbrace{P(D \, | \, \theta)}^{likelihood}}{\underbrace{P(D)}_{evidence}}\]

Bayes rule in multi-D

joint probability distribution as two-dimensional matrix:

##       blond brown  red black
## blue   0.03  0.04 0.00  0.41
## green  0.09  0.09 0.05  0.01
## brown  0.04  0.02 0.09  0.13

marginal distribution over eye color:

##  blue green brown 
##  0.48  0.24  0.28

conditional probability given black hair:

##  blue green brown 
##  0.75  0.02  0.24

model = prior & likelihood

model of a coin flip:

bias parameter \(\theta \in \{0, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, 1\}\): probability of success on single trial
flat prior beliefs: \(P(\theta) = .2\,, \forall \theta\)
likelihood \(P(D \, | \, \theta)\) of data given \(\theta\):

##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads   0  0.33   0.5  0.67   1
## tails   1  0.67   0.5  0.33   0

weighing in \(P(\theta)\) gives joint-probability distribution as 2d matrix:

##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0  0.07   0.1  0.13 0.2
## tails 0.2  0.13   0.1  0.07 0.0

Bayesian inference

Bayes rule: \(P(\theta \, | \, D) \propto P(\theta) \times P(D \, | \, \theta)\)

##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0  0.07   0.1  0.13 0.2
## tails 0.2  0.13   0.1  0.07 0.0

posterior probability \(P(\theta \, | \, \text{heads})\) after a toss with heads:

##   t=0 t=1/3 t=1/2 t=2/3   t=1 
##  0.00  0.13  0.20  0.27  0.40

generalized model of coin flips

inifinite parameter space \(\theta \in [0;1]\)
likelihood of observing \(k\) successes, given \(N\) flips, is binomial distribution:

\[P(k \mid N, \theta) = {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

example of a data-generating model:

modelGraph

[see Lee & Wagenmakers (2015) on conventions for graphical notation]

examples

uniform prior: \(\theta \sim Beta(1,1)\)

examples

prior biased towards successes: \(\theta \sim Beta(7,3)\)

examples

prior biased towards losses: \(\theta \sim Beta(3,7)\)

independence of stopping rule

\(P(\theta \mid D)\) is independent of stopping criterion during data collection
any normalizing constant \(X\) cancels out:

\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]

properties of Bayesian inference

extensional

independent of methods for data collection, as long as these do not influence \(P(D \mid \theta)\)

doxastic

is about subjective beliefs:
- naturally interpretable
- highly informative, distributional information
- feeds directly into Bayesian decision theory

estimation
comparison
criticism

outlook

parameter estimation: what to conclude from the data given the model?

maximum likelihood
full Bayesian inference
credible intervals

model comparison: which of several models is better?

information criteria
Bayes factors

model criticism: is my model any good?

\(p\)-values
- prior, posterior and classical
posterior predictive checks

preview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

estimation

credible interval

for distribution \(P(x)\) over \(X\), the \(n\)% credible interval is a subset \(Y \subseteq X\) such that:

\(P(Y) = \frac{n}{100}\), and
no point outside of \(Y\) is more likely than any point within

dummy

intuition: range of values we are justified to belief in (categorically).

[alternative terminology: highest density interval (HDI), credible region, …]

examples

KruschkeFig5.3

posterior credible \(\theta\)'s

observation: \(k = 7\), \(N = 24\)
model:
- \(\theta \sim Beta(1,1)\)
- \(P(k) \sim Binomial(k,N)\)
posterior: \(P(\theta \mid k=7, N = 24)\)

ROPEs and credible values

regions of practical equivalence (ROPE)

small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
- values (practically) indistinguishable from \(\theta\)

credible values

value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI

[this is mainly Kruschke's (2014) approach]

model comparison

main question

which model is better given data? (e.g., null model vs. alternative)

caveat

different answers may imply different notions of "model" & purposes of modeling

key notions

information criteria

maximum likelihood estimation
free parameters
model complexity

Bayes factors

Savage-Dickey method
Lindley paradox

Akaike information criterion

motivation

model is better, the higher \(P(D \mid \hat{\theta})\)
- where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\)
model is worse, the more parameters it has
- principle of parsimony (Ockham's razor)
information theoretic notion:
- amount of information lost if we assume data generated by \(\hat{\theta}\)

definition

let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - \ln P(D \mid \hat{\theta})\]

the smaller the AIC, the better the model

model comparison by AIC

example

k = 7
N = 24
AIC_nh = - dbinom(k,N, prob = 0.5, log = T)
AIC_ah = 2 - dbinom(k,N, prob = k/N, log = T)
show(data.frame(model = c("null", "alt"),
                AIC = c(AIC_nh, AIC_ah) ))

##   model      AIC
## 1  null 3.881038
## 2   alt 3.732799

weight of evidence

let \(AIC_i\) be model \(i\)'s AIC & let \(\Delta_i = AIC_i - \min_j AIC_j\)
weight of evidence for model \(i\) is: \(w_i \propto \exp(-0.5 \Delta_i)\)
e.g., weight of evidence for alternative model is 1.077 (small!)

[e.g., Burnham & Anderson (2002)]

remarks on information criteria

given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
"model" for AICs is just likelihood; no prior
not predictive, but "post-hoc" as \(\hat{\theta}\) depends on data
discounting number of parameters, like AIC does, does not take effective strength of parameters into account
there are other information criteria that address some of these problems:
- Bayesian information criterion
- deviance information criterion

Bayes factors

take two models (in the sense of "model = prior + likelihood")
- \(P(\theta_1 , M_1)\) and \(P(D \mid \theta_1, M_1)\)
- \(P(\theta_2 , M_2)\) and \(P(D \mid \theta_2, M_2)\)
ideally, we'd want to know the absolute probability of \(M_i\) given the data
- but then we'd need to know set of all models (for normalization)
alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

evidence

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

evidence of model \(M_i\) (= marginal likelihood of data)

\[P(D \mid M_i) = \int P(\theta_i , M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

evidence marginalizes out parameters \(\theta_i\): function of prior and likelihood

example

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
model specification:
- \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
- \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{ {{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{k} \, (1-0.5)^{N - k}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

how to interpret Bayes factors

BF(M1 > M2)	interpretation
1	irrelevant data
1 - 3	hardly worth ink or breath
3 - 6	anecdotal
6 - 10	now we're talking: substantial
10 - 30	strong
30 - 100	very strong
100 +	decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

calculate each model's evidence
- brute force clever math
- grid approximation
- by MCMC estimation
calculate Bayes factor:
- transdimensional MCMC
- Savage-Dickey method

properly nested models

suppose that there are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
\(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
\(M_0\) is properly nested under \(M_1\) if:
- \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
- \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
- \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

recap

Bayes rule for parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{\text{posterior}} \propto \underbrace{P(\theta)}_{\text{prior}} \times \underbrace{P(D \, | \, \theta)}_{\text{likelihood}}\]

Bayes factor for model comparison:

\[ \begin{align*} \underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} & = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}} \\ \underbrace{P(D \mid M_i)}_{\text{evidence}} & = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i \end{align*} \]

\(p\)-values for null-hypothesis significance testing

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

comparison of approaches

Jeffreys-Lindley paradox

k = 49581
N = 98451
show(k/N)

## [1] 0.5036109

\(p\)-value NHST

binom.test(k, N)$p.value

## [1] 0.02364686

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)

## [1] 19.21139

simulation

let the true bias be \(\theta = 0.5\)
generate all possible outcomes \(k\) keeping \(N\) fixed
- \(N \in \{ 10, 100, 1000, 10000, 100000 \}\)
- true frequency of \(k\) is \(Binomial(k \mid N, \theta = 0.5)\)
look at the frequency of test results, coded thus:

	estimation	comparison	criticism
\(M_0\)	\([.5-\epsilon, 0.5+\epsilon] \sqsubseteq\) 95% HDI or v.v.	BF(\(M_0\)>\(M_1\)) > 6	\(p\) > 0.05
\(M_1\)	\([.5-\epsilon, 0.5+\epsilon] \, \cap \,\) 95% HDI \(=\emptyset\)	BF(\(M_1\)>\(M_0\)) > 6	\(p\) <= 0.05
??	otherwise	otherwise	never

results

Bayes factor model comparison selects \(M_0\) correctly with probability 0.986 for \(N = 10000\), and with 0.996 for \(N = 100000\).

[c.f., Lindley's solution to 'paradox': adjust \(p\) depending on \(N\); similar for ROPE's \(\epsilon\)]

wrap-up

notions covered

NHST \(p\)-value logic

posterior inference
- credible intervals, ROPE

model comparison
- Bayes factors

At a glance

Example: coin flips

"Classical statistics"

Pros & Cons of BDA

Pro

Con

3 times Bayes

Goals of this course

The road ahead

NHST & \(p\)-value logic

binomial distribution

NHST \(p\)-value logic

negative binomial distribution

another \(p\)-value for "same" data set

some properties of \(p\)-value NHST

intensional

non-doxastic

Bayesian basics

key notions

Bayes rule in multi-D

model = prior & likelihood

Bayesian inference

generalized model of coin flips

examples

examples

examples

independence of stopping rule

properties of Bayesian inference

extensional

doxastic

estimation comparison criticism

outlook

preview

estimation

credible interval

examples

posterior credible \(\theta\)'s

ROPEs and credible values

regions of practical equivalence (ROPE)

credible values

model comparison

model comparison

main question

caveat

key notions

information criteria

Bayes factors

Akaike information criterion

motivation

definition

model comparison by AIC

example

weight of evidence

remarks on information criteria

Bayes factors

evidence

example

how to interpret Bayes factors

how to caculate Bayes factors

properly nested models

Savage-Dickey method

recap

overview

comparison of approaches

Jeffreys-Lindley paradox

\(p\)-value NHST

Savage-Dickey BF

simulation

results

wrap-up

notions covered

estimation
comparison
criticism