- basics
- joint distributions & marginalization
- conditional probability & Bayes rule
- 3 pillars of Bayesian data analysis:
- estimation
- comparison
- prediction
- parameter estimation for coin flips
- conjugate priors
- highest density interval
\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
definition of conditional probability:
\[P(X \, | \, Y) = \frac{P(X \cap Y)}{P(Y)}\]
definition of Bayes rule:
\[P(X \, | \, Y) = \frac{P(Y \, | \, X) \ P(X)}{P(Y)}\]
version for data analysis:
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
joint probability distribution as a two-dimensional matrix:
knitr::kable(prob2ds)
blond | brown | red | black | |
---|---|---|---|---|
blue | 0.03 | 0.04 | 0.00 | 0.41 |
green | 0.09 | 0.09 | 0.05 | 0.01 |
brown | 0.04 | 0.02 | 0.09 | 0.13 |
marginal distribution over eye color:
rowSums(prob2ds)
## blue green brown ## 0.48 0.24 0.28
joint probability distribution as a two-dimensional matrix:
prob2ds
## blond brown red black ## blue 0.03 0.04 0.00 0.41 ## green 0.09 0.09 0.05 0.01 ## brown 0.04 0.02 0.09 0.13
conditional probability given blue eyes:
prob2ds["blue",] %>% (function(x) x/sum(x))
## blond brown red black ## 0.06250000 0.08333333 0.00000000 0.85416667
model likelihood \(P(D \, | \, \theta)\):
likelihood
## t=0 t=1/3 t=1/2 t=2/3 t=1 ## succ 0 0.33 0.5 0.67 1 ## fail 1 0.67 0.5 0.33 0
weighing in \(P(\theta)\):
prob2d = likelihood * 0.2 prob2d
## t=0 t=1/3 t=1/2 t=2/3 t=1 ## succ 0.0 0.066 0.1 0.134 0.2 ## fail 0.2 0.134 0.1 0.066 0.0
back to start: joint-probability distribution as 2d matrix again
Bayes rule: \(P(\theta \, | \, D) \propto P(\theta) \times P(D \, | \, \theta)\)
prob2d
## t=0 t=1/3 t=1/2 t=2/3 t=1 ## succ 0.0 0.066 0.1 0.134 0.2 ## fail 0.2 0.134 0.1 0.066 0.0
posterior \(P(\theta \, | \, \text{heads})\) after one success:
prob2d["succ",] %>% (function(x) x / sum(x))
## t=0 t=1/3 t=1/2 t=2/3 t=1 ## 0.000 0.132 0.200 0.268 0.400
this section is for overview and outlook only
we will deal with this in detail later
given model and data, which parameter values should we believe in?
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
which of two models is more likely, given the data?
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
which future observations do we expect (after seeing some data)?
prior predictive
\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]
posterior predictive
\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]
requires sampling distribution (more on this later)
special case: prior/posterior predictive \(p\)-value (model criticism)
focus on parameter estimation first
look at computational tools for efficiently calculating posterior \(P(\theta \mid D)\)
use clever theory to reduce model comparison to parameter estimation
recap: binomial distribution:
\[ B(k ; n, \theta) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]
parameter estimation problem
\[ P(\theta \mid k, n) = \frac{P(\theta) \ B(k ; n, \theta)}{\int P(\theta') \ B(k ; n, \theta') \ \text{d}\theta} \]
hey!?! what about the \(p\)-problems, sampling distributions etc.?
claim: estimation of \(P(\theta \mid D)\) is independent of assumptions about sample space and sample procedure
proof
any normalizing constant \(X\) cancels out:
\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]
what if \(\theta\) is allowed to have any value \(\theta \in [0;1]\)?
two problems
one solution
2 shape parameters \(a, b > 0\), defined over domain \([0;1]\)
\[\text{Beta}(\theta \, | \, a, b) \propto \theta^{a-1} \, (1-\theta)^{b-1}\]
if prior \(P(\theta)\) and posterior \(P(\theta \, | \, D)\) are of the same family, they conjugate, and the prior \(P(\theta)\) is called conjugate prior for the likelihood function \(P(D \, | \, \theta)\) from which the posterior \(P(\theta \, | \, D)\) is derived
claim: the beta distribution is the conjugate prior of a binomial likelihood function
proof
\[ \begin{align*} P(\theta \mid \tuple{k, n}) & \propto B(k ; n, \theta) \ \text{Beta}(\theta \, | \, a, b) \\ P(\theta \mid \tuple{k, n}) & \propto \theta^{k} \, (1-\theta)^{n-k} \, \theta^{a-1} \, (1-\theta)^{b-1} \ \ = \ \ \theta^{k + a - 1} \, (1-\theta)^{n-k +b -1} \\ P(\theta \mid \tuple{k, n}) & = \text{Beta}(\theta \, | \, k + a, n-k + b) \end{align*} \]
posterior is a "compromise" between prior and likelihood
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
given distribution \(P(\cdot) \in \Delta(X)\), the 95% highest density interval is a subset \(Y \subseteq X\) such that:
dummy
Intuition: range of values we are justified to belief in (categorically).
caveat: NOT the same as the 2.5%-97.5% quantile range!!
observed: \(k = 7\) successes in \(n = 24\) flips;
prior: \(\theta \sim \text{Beta}(1,1)\)
problems:
dummy
solution:
Tuesday
Friday
obligatory
prepare Kruschke chapter 7
finish first homework set: due Friday before class