At a glance

  • BDA is about what we should believe given:
    • some observable data, and
    • our model of how this data was generated.
  • Our best friend will be Bayes rule: \[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \times \underbrace{P(D \, | \, \theta)}_{likelihood}\]
  • If \(P(\theta \, | \, D)\) is hard to compute, we resort to magic some clever stuff.

reverendB

Example: coin flips

  • \(\theta \in [0;1]\) is the bias of a coin:
    • if we throw a coin, the outcome will be heads with probability \(\theta\)
  • we have no clue about \(\theta\) at the outset:
    • a priori we consider every possible value of \(\theta\) equally likely
  • we observe that 7 out of 24 flips 7 were heads
  • what shall we believe about \(\theta\) now?

"Classical statistics"

  • null hypothesis significance testing (NHST)
    • e.g., is the coin fair (\(\theta = 0.5\))
  • signals if the NH should be rejected
    • not: how likely it is or if it is to be accepted
  • relies on sampling distributions & p-values
    • standard "tests" can have rigid built-in assumptions
    • implicitly rely on experimenter's intentions
  • looks at point estimates only

significance

Pros & Cons of BDA

Pro

  • well-founded & totally general
  • easily extensible / customizable
  • more informative / insightful

Drawing

Con

  • less ready-made, thinking required
  • not yet fully digested by community
  • higher computational complexity

Drawing2

3 times Bayes

 

Bayesian data analysis - Bayesian analogues or alternatives to "classical" tests

 

Bayesian (cognitive) modeling - custom models of the data-generating process

 

Bayes in the head - model (human) cognition as Bayesian inference

Goals of this course

The road ahead

theory

  • posterior inference & credible values

  • Bayes factors & model comparison

  • comparison of Bayesian and "classical" NHST

 

practice

  • basics of MCMC sampling

  • tools for BDA ( JAGS, Stan (rstanarm), WebPPL, Jasp )

  • Bayesian cognitive modeling example

NHST & \(p\)-value logic

binomial distribution

  • take \(N\) flips of a coin with bias \(\theta\)
  • binomial distribution gives the probability of observing \(k\) successes:

\[P(k \mid N, \theta) = {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

  • example: \(N=24\), \(\theta = 0.5\)

NHST \(p\)-value logic

  • we observed \(k=7\) successes after \(N=24\) flips

  • null hypothesis: the coin is fair, i.e., \(\theta = 0.5\)

  • the \(p\)-value of \(k=7\) is the probability of observing an outcome that is at least as unlikely as \(k=7\) under the NH in infinite repetitions of the experiment

  • significance: reject NH if \(p\)-value is under a predetermined threshold (e.g., 0.05)

negative binomial distribution

  • flip a coin with bias \(\theta\) until we have observed \(k\) successes
  • negative binomial distribution gives the probability of observing \(N\) flips:

\[P(N \mid k, \theta) = \frac{z}{N} \, {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

  • example: \(k=7\), \(\theta = 0.5\)

another \(p\)-value for "same" data set

  • we observed \(N=24\) flips for a success count of \(k=7\)
    • NB: same data set as before but obtained differently

some properties of \(p\)-value NHST

intensional

  • hinges on a what it means to repeat the experiment
    • "model" = data-generating (e.g., psychological) + data-collecting processes

 

non-doxastic

  • non-significance \(\neq\) evidence for "the" alternative hypothesis
    • no information about any alternative hypothesis is used anywhere
  • significance \(\neq\) evidence for NH
    • evidence is a relative notion: shifting plausibility between hypotheses

Bayesian basics

key notions

conditional probability:

\[P(X \, | \, Y) = \frac{P(X \cap Y)}{P(Y)}\]

Bayes rule:

\[P(X \, | \, Y) = \frac{P(X) \times P(Y \, | \, X)}{P(Y)}\]

Bayes rule for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} = \frac{\overbrace{P(\theta)}^{prior} \times \overbrace{P(D \, | \, \theta)}^{likelihood}}{\underbrace{P(D)}_{evidence}}\]

Bayes rule in multi-D

joint probability distribution as two-dimensional matrix:

##       blond brown  red black
## blue   0.03  0.04 0.00  0.41
## green  0.09  0.09 0.05  0.01
## brown  0.04  0.02 0.09  0.13

marginal distribution over eye color:

##  blue green brown 
##  0.48  0.24  0.28

conditional probability given black hair:

##  blue green brown 
##  0.75  0.02  0.24

model = prior & likelihood

model of a coin flip:

  • bias parameter \(\theta \in \{0, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, 1\}\): probability of success on single trial
  • flat prior beliefs: \(P(\theta) = .2\,, \forall \theta\)
  • likelihood \(P(D \, | \, \theta)\) of data given \(\theta\):
##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads   0  0.33   0.5  0.67   1
## tails   1  0.67   0.5  0.33   0

weighing in \(P(\theta)\) gives joint-probability distribution as 2d matrix:

##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0  0.07   0.1  0.13 0.2
## tails 0.2  0.13   0.1  0.07 0.0

Bayesian inference

Bayes rule: \(P(\theta \, | \, D) \propto P(\theta) \times P(D \, | \, \theta)\)

##       t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0  0.07   0.1  0.13 0.2
## tails 0.2  0.13   0.1  0.07 0.0

posterior probability \(P(\theta \, | \, \text{heads})\) after a toss with heads:

##   t=0 t=1/3 t=1/2 t=2/3   t=1 
##  0.00  0.13  0.20  0.27  0.40

generalized model of coin flips

  • inifinite parameter space \(\theta \in [0;1]\)

  • likelihood of observing \(k\) successes, given \(N\) flips, is binomial distribution:

\[P(k \mid N, \theta) = {{N}\choose{k}} \, \theta^{k} \, (1-\theta)^{N - k}\]

  • example of a data-generating model:

 

modelGraph

[see Lee & Wagenmakers (2015) on conventions for graphical notation]

examples

uniform prior: \(\theta \sim Beta(1,1)\)

 

examples

 

prior biased towards successes: \(\theta \sim Beta(7,3)\)

examples

prior biased towards losses: \(\theta \sim Beta(3,7)\)

 

independence of stopping rule

 

  • \(P(\theta \mid D)\) is independent of stopping criterion during data collection
  • any normalizing constant \(X\) cancels out:

 

\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]

properties of Bayesian inference

extensional

  • independent of methods for data collection, as long as these do not influence \(P(D \mid \theta)\)

 

doxastic

  • is about subjective beliefs:
    • naturally interpretable
    • highly informative, distributional information
    • feeds directly into Bayesian decision theory

estimation
comparison
criticism

outlook

parameter estimation: what to conclude from the data given the model?

  • maximum likelihood
  • full Bayesian inference
  • credible intervals

model comparison: which of several models is better?

  • information criteria
  • Bayes factors

model criticism: is my model any good?

  • \(p\)-values
    • prior, posterior and classical
  • posterior predictive checks

preview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

estimation

credible interval

 

for distribution \(P(x)\) over \(X\), the \(n\)% credible interval is a subset \(Y \subseteq X\) such that:

  1. \(P(Y) = \frac{n}{100}\), and
  2. no point outside of \(Y\) is more likely than any point within

dummy

intuition: range of values we are justified to belief in (categorically).

[alternative terminology: highest density interval (HDI), credible region, …]

examples

KruschkeFig5.3

posterior credible \(\theta\)'s

  • observation: \(k = 7\), \(N = 24\)
  • model:
    • \(\theta \sim Beta(1,1)\)
    • \(P(k) \sim Binomial(k,N)\)
  • posterior: \(P(\theta \mid k=7, N = 24)\)

ROPEs and credible values

 

regions of practical equivalence (ROPE)

  • small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
    • values (practically) indistinguishable from \(\theta\)

 

credible values

  • value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
  • value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI

[this is mainly Kruschke's (2014) approach]

model comparison

model comparison

 

main question

which model is better given data? (e.g., null model vs. alternative)

 

caveat

different answers may imply different notions of "model" & purposes of modeling

key notions

information criteria

  • maximum likelihood estimation
  • free parameters
  • model complexity

Bayes factors

  • Savage-Dickey method
  • Lindley paradox

Akaike information criterion

motivation

  • model is better, the higher \(P(D \mid \hat{\theta})\)
    • where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\)
  • model is worse, the more parameters it has
    • principle of parsimony (Ockham's razor)
  • information theoretic notion:
    • amount of information lost if we assume data generated by \(\hat{\theta}\)

definition

let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - \ln P(D \mid \hat{\theta})\]

the smaller the AIC, the better the model

model comparison by AIC

example

k = 7
N = 24
AIC_nh = - dbinom(k,N, prob = 0.5, log = T)
AIC_ah = 2 - dbinom(k,N, prob = k/N, log = T)
show(data.frame(model = c("null", "alt"),
                AIC = c(AIC_nh, AIC_ah) ))
##   model      AIC
## 1  null 3.881038
## 2   alt 3.732799

weight of evidence

  • let \(AIC_i\) be model \(i\)'s AIC & let \(\Delta_i = AIC_i - \min_j AIC_j\)
  • weight of evidence for model \(i\) is: \(w_i \propto \exp(-0.5 \Delta_i)\)
  • e.g., weight of evidence for alternative model is 1.077 (small!)

[e.g., Burnham & Anderson (2002)]

remarks on information criteria

  • given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
  • "model" for AICs is just likelihood; no prior
  • not predictive, but "post-hoc" as \(\hat{\theta}\) depends on data
  • discounting number of parameters, like AIC does, does not take effective strength of parameters into account
  • there are other information criteria that address some of these problems:
    • Bayesian information criterion
    • deviance information criterion

Bayes factors

  • take two models (in the sense of "model = prior + likelihood")
    • \(P(\theta_1 , M_1)\) and \(P(D \mid \theta_1, M_1)\)
    • \(P(\theta_2 , M_2)\) and \(P(D \mid \theta_2, M_2)\)
  • ideally, we'd want to know the absolute probability of \(M_i\) given the data
    • but then we'd need to know set of all models (for normalization)
  • alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

evidence

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

evidence of model \(M_i\) (= marginal likelihood of data)

\[P(D \mid M_i) = \int P(\theta_i , M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

evidence marginalizes out parameters \(\theta_i\): function of prior and likelihood

example

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
  • model specification:
    • \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
    • \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{ {{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{k} \, (1-0.5)^{N - k}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

how to interpret Bayes factors

BF(M1 > M2) interpretation
1 irrelevant data
1 - 3 hardly worth ink or breath
3 - 6 anecdotal
6 - 10 now we're talking: substantial
10 - 30 strong
30 - 100 very strong
100 + decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

  1. calculate each model's evidence
    • brute force clever math
    • grid approximation
    • by MCMC estimation
  2. calculate Bayes factor:
    • transdimensional MCMC
    • Savage-Dickey method

properly nested models

  • suppose that there are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
  • \(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
  • \(M_0\) is properly nested under \(M_1\) if:
    • \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
    • \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
    • \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

recap

Bayes rule for parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{\text{posterior}} \propto \underbrace{P(\theta)}_{\text{prior}} \times \underbrace{P(D \, | \, \theta)}_{\text{likelihood}}\]

Bayes factor for model comparison:

\[ \begin{align*} \underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} & = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}} \\ \underbrace{P(D \mid M_i)}_{\text{evidence}} & = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i \end{align*} \]

\(p\)-values for null-hypothesis significance testing

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

comparison of approaches

Jeffreys-Lindley paradox

k = 49581
N = 98451
show(k/N)
## [1] 0.5036109

\(p\)-value NHST

binom.test(k, N)$p.value
## [1] 0.02364686

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)
## [1] 19.21139

simulation

  • let the true bias be \(\theta = 0.5\)
  • generate all possible outcomes \(k\) keeping \(N\) fixed
    • \(N \in \{ 10, 100, 1000, 10000, 100000 \}\)
    • true frequency of \(k\) is \(Binomial(k \mid N, \theta = 0.5)\)
  • look at the frequency of test results, coded thus:
estimation comparison criticism
\(M_0\) \([.5-\epsilon, 0.5+\epsilon] \sqsubseteq\) 95% HDI or v.v. BF(\(M_0\)>\(M_1\)) > 6 \(p\) > 0.05
\(M_1\) \([.5-\epsilon, 0.5+\epsilon] \, \cap \,\) 95% HDI \(=\emptyset\) BF(\(M_1\)>\(M_0\)) > 6 \(p\) <= 0.05
?? otherwise otherwise never

results

Bayes factor model comparison selects \(M_0\) correctly with probability 0.986 for \(N = 10000\), and with 0.996 for \(N = 100000\).

[c.f., Lindley's solution to 'paradox': adjust \(p\) depending on \(N\); similar for ROPE's \(\epsilon\)]

wrap-up

notions covered

  • NHST \(p\)-value logic

 

  • posterior inference
    • credible intervals, ROPE

 

  • model comparison
    • Bayes factors