\[ \definecolor{firebrick}{RGB}{178,34,34} \newcommand{\red}[1]{{\color{firebrick}{#1}}} \] \[ \definecolor{mygray}{RGB}{178,34,34} \newcommand{\mygray}[1]{{\color{mygray}{#1}}} \] \[ \newcommand{\set}[1]{\{#1\}} \] \[ \newcommand{\tuple}[1]{\langle#1\rangle} \] \[\newcommand{\States}{{T}}\] \[\newcommand{\state}{{t}}\] \[\newcommand{\pow}[1]{{\mathcal{P}(#1)}}\]

key notions

  • basics
    • joint distributions & marginalization
    • conditional probability & Bayes rule
  • 3 pillars of Bayesian data analysis:
    • estimation
    • comparison
    • prediction
  • parameter estimation for coin flips
    • conjugate priors
    • highest density interval

recap

definition of conditional probability:

\[P(X \, | \, Y) = \frac{P(X \cap Y)}{P(Y)}\]

definition of Bayes rule:

\[P(X \, | \, Y) = \frac{P(Y \, | \, X) \ P(X)}{P(Y)}\]

version for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

Bayes rule in multi-D

proportions of eye & hair color

joint probability distribution as a two-dimensional matrix:

knitr::kable(prob2ds)
blond brown red black
blue 0.03 0.04 0.00 0.41
green 0.09 0.09 0.05 0.01
brown 0.04 0.02 0.09 0.13

marginal distribution over eye color:

  rowSums(prob2ds)
##  blue green brown 
##  0.48  0.24  0.28

proportions of eye & hair color

joint probability distribution as a two-dimensional matrix:

prob2ds
##       blond brown  red black
## blue   0.03  0.04 0.00  0.41
## green  0.09  0.09 0.05  0.01
## brown  0.04  0.02 0.09  0.13

conditional probability given blue eyes:

  prob2ds["blue",] %>% (function(x) x/sum(x))
##      blond      brown        red      black 
## 0.06250000 0.08333333 0.00000000 0.85416667

model & data

  • single coin flip with unknown success bias \(\theta \in \{0, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, 1\}\)
  • flat prior beliefs: \(P(\theta) = .2\,, \forall \theta\)

 

model likelihood \(P(D \, | \, \theta)\):

  likelihood
##      t=0 t=1/3 t=1/2 t=2/3 t=1
## succ   0  0.33   0.5  0.67   1
## fail   1  0.67   0.5  0.33   0

weighing in \(P(\theta)\):

  prob2d = likelihood  * 0.2
  prob2d
##      t=0 t=1/3 t=1/2 t=2/3 t=1
## succ 0.0 0.066   0.1 0.134 0.2
## fail 0.2 0.134   0.1 0.066 0.0

back to start: joint-probability distribution as 2d matrix again

model, data & Bayesian inference

Bayes rule: \(P(\theta \, | \, D) \propto P(\theta) \times P(D \, | \, \theta)\)

  prob2d
##      t=0 t=1/3 t=1/2 t=2/3 t=1
## succ 0.0 0.066   0.1 0.134 0.2
## fail 0.2 0.134   0.1 0.066 0.0

posterior \(P(\theta \, | \, \text{heads})\) after one success:

  prob2d["succ",] %>% 
    (function(x) x / sum(x))
##   t=0 t=1/3 t=1/2 t=2/3   t=1 
## 0.000 0.132 0.200 0.268 0.400

3 pillars of BDA

caveat

 

this section is for overview and outlook only

 

we will deal with this in detail later

estimation

 

given model and data, which parameter values should we believe in?

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

model comparison

which of two models is more likely, given the data?

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

prediction

 

which future observations do we expect (after seeing some data)?

 

prior predictive

\[ P(D) = \int P(\theta) \ P(D \mid \theta) \ \text{d}\theta \]

 

posterior predictive

\[ P(D \mid D') = \int P(\theta \mid D') \ P(D \mid \theta) \ \text{d}\theta \]

 

   

requires sampling distribution (more on this later)

special case: prior/posterior predictive \(p\)-value (model criticism)

outlook

 

  • focus on parameter estimation first

  • look at computational tools for efficiently calculating posterior \(P(\theta \mid D)\)

  • use clever theory to reduce model comparison to parameter estimation

coin bias estimation

likelihood function for several tosses

  • success is 1; failure is 0
  • pair \(\tuple{k,n}\) is an outcome with \(k\) success in \(n\) flips

recap: binomial distribution:

\[ B(k ; n, \theta) = \binom{n}{k} \theta^{k} \, (1-\theta)^{n-k} \]

parameter estimation problem

\[ P(\theta \mid k, n) = \frac{P(\theta) \ B(k ; n, \theta)}{\int P(\theta') \ B(k ; n, \theta') \ \text{d}\theta} \]

hey!?! what about the \(p\)-problems, sampling distributions etc.?

parameter estimation & normalized likelihoods

claim: estimation of \(P(\theta \mid D)\) is independent of assumptions about sample space and sample procedure

proof

any normalizing constant \(X\) cancels out:

\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]

\(\Box\)

welcome infinity

what if \(\theta\) is allowed to have any value \(\theta \in [0;1]\)?

two problems

  1. how to specify \(P(\theta)\) in a concise way?
  2. how to compute normalizing constant \(\int_0^1 P(\theta) \ P(D \, | \, \theta) \, \text{d}\theta\)?

one solution

  • use beta distribution to specify prior \(P(\theta)\) with some handy parameters
  • since this is the conjugate prior to our likelihood function, computing posteriors is as easy as sleep

 

 

KruschkeFig5.3

beta distribution

2 shape parameters \(a, b > 0\), defined over domain \([0;1]\)

\[\text{Beta}(\theta \, | \, a, b) \propto \theta^{a-1} \, (1-\theta)^{b-1}\]

KruschkeFig6.1

conjugate distributions

 

if prior \(P(\theta)\) and posterior \(P(\theta \, | \, D)\) are of the same family, they conjugate, and the prior \(P(\theta)\) is called conjugate prior for the likelihood function \(P(D \, | \, \theta)\) from which the posterior \(P(\theta \, | \, D)\) is derived

claim: the beta distribution is the conjugate prior of a binomial likelihood function

proof

\[ \begin{align*} P(\theta \mid \tuple{k, n}) & \propto B(k ; n, \theta) \ \text{Beta}(\theta \, | \, a, b) \\ P(\theta \mid \tuple{k, n}) & \propto \theta^{k} \, (1-\theta)^{n-k} \, \theta^{a-1} \, (1-\theta)^{b-1} \ \ = \ \ \theta^{k + a - 1} \, (1-\theta)^{n-k +b -1} \\ P(\theta \mid \tuple{k, n}) & = \text{Beta}(\theta \, | \, k + a, n-k + b) \end{align*} \]

\(\Box\)

example applications

KruschkeFig6.4

priors, likelihood & posterior

Bayes' puppies

 

posterior is a "compromise" between prior and likelihood

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \ \underbrace{P(D \, | \, \theta)}_{likelihood}\]

 

Bayes puppies

influence of sample size on posterior

KruschkeFig5.2

influence of sample size on posterior

KruschkeFig5.3

highest density intervals

highest density interval

 

given distribution \(P(\cdot) \in \Delta(X)\), the 95% highest density interval is a subset \(Y \subseteq X\) such that:

  1. \(P(Y) = .95\), and
  2. no point outside of \(Y\) is more likely than any point within.

dummy

Intuition: range of values we are justified to belief in (categorically).

caveat: NOT the same as the 2.5%-97.5% quantile range!!

examples

KruschkeFig5.3

example

observed: \(k = 7\) successes in \(n = 24\) flips;

prior: \(\theta \sim \text{Beta}(1,1)\)

the road ahead

BDA more generally

problems:

  • conjugate priors are not always available:
    • likelihood functions can come from unbending beasts:
      • complex hierarchical models (e.g., regression)
      • custom-made stuff (e.g., probabilistic grammars)
  • even when available, they may not be what we want:
    • prior beliefs could be different from what a conjugate prior can capture

dummy

solution:

  • approximate posterior distribution by smart numerical simulations

fini

outlook

 

Tuesday

  • introduction to MCMC methods

 

Friday

  • introduction to JAGS

to prevent boredom

 

obligatory

  • prepare Kruschke chapter 7

  • finish first homework set: due Friday before class