\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]
frequentist
Bayesian
definition of conditional probability:
\[P(X \, | \, Y) = \frac{P(X \cap Y)}{P(Y)}\]
definition of Bayes rule:
\[P(X \, | \, Y) = \frac{P(Y \, | \, X) \ P(X)}{P(Y)}\]
version for data analysis:
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
likelihood \(P(D \, | \, \theta)\):
likelihood
## t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0 0.33 0.5 0.67 1
## tails 1 0.67 0.5 0.33 0
prior \(P(\theta)\):
prior = rep(1, times = ncol(likelihood)) / ncol(likelihood)
names(prior) = c("t=0", "t=1/3", "t=1/2", "t=2/3", "t=1")
prior
## t=0 t=1/3 t=1/2 t=2/3 t=1
## 0.2 0.2 0.2 0.2 0.2
joint-probability \(P(D, \theta)\):
joint_probability = likelihood * prior
joint_probability
## t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0 0.066 0.1 0.134 0.2
## tails 0.2 0.134 0.1 0.066 0.0
Bayes rule \(P(\theta \, | \, D) \propto P(\theta) \times P(D \, | \, \theta)\)
joint_probability
## t=0 t=1/3 t=1/2 t=2/3 t=1
## heads 0.0 0.066 0.1 0.134 0.2
## tails 0.2 0.134 0.1 0.066 0.0
posterior \(P(\theta \, | \, D = \text{head})\)
joint_probability["heads",] %>%
(function(x) x / sum(x))
## t=0 t=1/3 t=1/2 t=2/3 t=1
## 0.000 0.132 0.200 0.268 0.400
this section is for overview and outlook only
we will deal with this in detail later
given model and data, which parameter values should we believe in?
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
which of two models is more likely, given the data?
\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]
which future observations do we expect (after seeing some data)?
prior predictive
\[ P(D_{\text{future}}) = \int P(\theta) \ P(D_{\text{future}} \mid \theta) \ \text{d}\theta \]
posterior predictive
\[ P(D_{\text{future}} \mid D_{\text{past}}) = \int P(\theta \mid D_{\text{past}}) \ P(D_{\text{future}} \mid \theta) \ \text{d}\theta \]
requires sampling distribution (more on this later)
special case: prior/posterior predictive \(p\)-value (model criticism)
focus on parameter estimation first
look at computational tools for efficiently calculating posterior \(P(\theta \mid D)\)
use clever theory to reduce model comparison to parameter estimation
recap: binomial distribution:
\[ B(k ; n, \theta) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]
parameter estimation problem
\[ P(\theta \mid k, n) = \frac{P(\theta) \ B(k ; n, \theta)}{\int P(\theta') \ B(k ; n, \theta') \ \text{d}\theta} \]
hey!?! what about the \(p\)-problems, sampling distributions etc.?
claim: \(P(\theta \mid D)\) is independent of whether we stop at \(n=24\) or at \(k=7\)
proof
The only difference is a factor \(X\), which just cancels out:
\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]
what if \(\theta\) is allowed to have any value \(\theta \in [0;1]\)?
two problems
one solution
2 shape parameters \(a, b > 0\), defined over domain \([0;1]\)
\[\text{Beta}(\theta \, | \, a, b) \propto \theta^{a-1} \, (1-\theta)^{b-1}\]
if prior \(P(\theta)\) and posterior \(P(\theta \, | \, D)\) are of the same family, they conjugate, and the prior \(P(\theta)\) is called conjugate prior for the likelihood function \(P(D \, | \, \theta)\) from which the posterior \(P(\theta \, | \, D)\) is derived
claim: the beta distribution is the conjugate prior of a binomial likelihood function
proof
\[ \begin{align*} P(\theta \mid \tuple{k, n}) & \propto B(k ; n, \theta) \ \text{Beta}(\theta \, | \, a, b) \\ P(\theta \mid \tuple{k, n}) & \propto \theta^{k} \, (1-\theta)^{n-k} \, \theta^{a-1} \, (1-\theta)^{b-1} \ \ = \ \ \theta^{k + a - 1} \, (1-\theta)^{n-k +b -1} \\ P(\theta \mid \tuple{k, n}) & = \text{Beta}(\theta \, | \, k + a, n-k + b) \end{align*} \]
posterior is a “compromise” between prior and likelihood
\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]
problems:
dummy
solution:
introduction to MCMC methods (theory)
introduction to Stan (hands-on)