17.4 Bayesian \(p\)-values & model checking

The previous section showed how to approximate a \(p\)-value with Monte Carlo sampling. Notice that nothing in this sampling-based approach hinges on the model having no free parameters. Indeed, we can similarly approximate so-called Bayesian predictive \(p\)-values. Bayesian predictive \(p\)-values have a good role to play in Bayesian data analysis: they are one possible tool for model checking a.k.a. model criticism.

Suppose we have a Bayesian model for the binomial 24/7 data. The model consists of the usual likelihood function, but also has a prior (maybe from previous research, or maybe obtained from training the model on a training data set):

\[ \theta_c \sim \text{Beta}(11,2) \]

Notice that this is a biased prior, placing more weight on the idea that the coin is biased towards heads. In model checking we ask whether the given model could be a plausible model for some data at hand. We are not comparing models, we just “check” or “test” (!) the model as such. Acing the test doesn’t mean that there could not be much better models. Failing the test doesn’t mean that we know of a better model (we may just have to do more thinking).

Let’s approximate a Bayesian predictive \(p\)-value for this Bayesian model and the 24/7 data. The calculations are analogous to those in the previous section.

# 24/7 data
k_obs <- 7
n_obs <- 24

# specify how many Monte Carlo samples to take
x_reps <- 500000

# build a vector of likelihoods (= the relevant test statistic)
#   for hypothetical data observations, which are 
#   sampled based on the assumption that the
#   Bayesian model to be tested is true
lhs <- map_dbl(1:x_reps, function(i) {
  # hypothetical data assuming the model is true
  #   first sample from the prior
  #   then sample from the likelihood
  theta_hyp <- rbeta(1, 11, 2)
  k_hyp <- rbinom(1, size = n_obs, prob = theta_hyp)
  # likelihood of that hypothetical observation
  dbinom(k_hyp, size = n_obs, prob = theta_hyp)
})

# likelihood (= test statistic) of the observed data
#   determined using MC sampling
lh_obs = map_dbl(1:x_reps, function(i){
  theta_hyp <- rbeta(1, 11, 2)
  dbinom(k_obs, size = n_obs, prob = theta_hyp)
}) %>% mean()
  

# proportion of samples with a lower or equal likelihood than 
#   the observed data 
mean(lhs <= lh_obs) %>% show()

## [1] 0.000176

This Bayesian predictive \(p\)-value is rather low, suggesting that this model (prior & likelihood) is NOT a good model for the 24/7 data set.

We can use Bayesian \(p\)-values for any Bayesian model, whether built on a prior or posterior distribution. A common application of Bayesian \(p\)-values in model checking are so-called posterior predictive checks. We compute a Bayesian posterior for observed data \(D_\text{obs}\) and then test, via a Bayesian posterior predictive \(p\)-value, whether the trained model is actually a good model for \(D_\text{obs}\) itself. If the \(p\)-value is high, that’s no cause for hysterical glee. It just means that there is no cause for alarm. If the Bayesian posterior predictive \(p\)-value is very low, the posterior predictive test has failed, and that means that the model, even when trained on the data \(D_\text{obs}\), is NOT a good model of that very data. The model must miss something crucial about the data \(D_\text{obs}\). Better start researching what that is and build a better model if possible.

Most importantly, these considerations of Bayesian \(p\)-values show that frequentist testing has a clear analog in the Bayesian realm, namely as model checking.