Hypothesis testing

Bayesian regression: theory & practice

Author

Michael Franke

There are different ways of addressing research hypotheses with Bayesian models. For a more in-depth treatment see here. This unit gives code examples that showcase approaches to hypotheses testing based on the three pillars of BDA:

  1. estimation,
  2. model criticism, and
  3. model comparison.

The running example is the 24/7 coin flip case (flipping a coin \(N=24\) times and observing \(k=7\) heads), addressing the research question that the coin is fair (\(\theta = 0.5\)).

Preamble

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Toggle code
# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "rstan",
  "rstanarm",
  "remotes",
  "tidybayes",
  "bridgesampling",
  "shinystan",
  "mgcv"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors() |> pull(hex)
# names(project_colors) <- cspplot::list_colors() |> pull(name)

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}

Estimation-based testing

In estimation-based testing we use a single model \(M\), condition it on the observed data \(D\), and then ask whether the hypothesis is supported or discredited by the posterior. For point-valued hypothesis, like \(\theta = 0.5\), we can use region of practical equivalence (ROPE), following Kruschke. For example, we may consider a ROPE of \(0.5 \pm 0.01\). There are two salient methods:

  1. comparison against a credible interval, and
  2. direct inspection of the posterior probability.

Credible Intervals

The first approach compares the ROPE to an \(n\)% credible interval of the posterior. The plot below shows the posterior, starting from a flat prior, for our case. it also shows the 95% credible interval.

Toggle code
hdi = HDInterval::hdi(qbeta , shape1 = 8 , shape2 = 18 )
hdiData <- tibble(
  theta = rep(hdi, each = 2),
  post = c(0,dbeta(hdi, 8, 18), 0)
)
expData <- tibble(
  theta = c(8/26,8/26),
  post = c(0,dbeta(8/26, 8, 18 ))
)

tibble(
  theta = seq(0.01,1, by = 0.01),
  posterior = dbeta(seq(0.01,1, by = 0.01), 8, 18 )
) %>% 
  ggplot(aes(x = theta, y = posterior)) + 
  xlim(0,1) + 
  labs(
    x = latex2exp::TeX("Bias $\\theta$"),
    y = latex2exp::TeX("Posterior probability $P_{M}(\\theta \\, | \\, D)$"),
    title = "Posterior"
  ) +
  geom_line(data = hdiData, aes(x = theta, y = post), color = project_colors[2], size = 1.5) +
  geom_label(x = 0.7, y = 0.5, label = "Cred.Int.: 0.14 - 0.48", color = project_colors[2], size = 5) +
  geom_line(data = expData, aes(x = theta, y = post), color = project_colors[1], size = 1.5) +
  geom_label(x = 0.52, y = dbeta(8/26, 8, 18 ), label = "expectation: 0.308", color = project_colors[1], size = 5) +
  geom_line(color = "black", size = 2)

Posterior probability

Toggle code
k <- 7
N <- 24

ROPE <- c(0.49,0.51)

# posterior probability of the ROPE
postProb_ROPE <- pbeta(ROPE[2],8,18) - pbeta(ROPE[1],8,18)

# plot
plotData <- tibble(
  theta = seq(0,1, length.out = 200),
  posterior = dbeta(theta, 8, 18)
)
plotData |> 
  ggplot(aes(x = theta, y = posterior)) + 
  geom_ribbon(aes(ymin=0, ymax=posterior), 
              fill=project_colors[2],
              alpha=0.8, 
              data=subset(plotData, theta >= 0.485 & theta <= 0.515)) +
  geom_line(size = 2) +
  xlim(0,1) + 
  geom_label(x = 0.75, y = 0.5, 
             label = str_c("Post. prob. ROPE: ", round(postProb_ROPE, 3)), 
             color = project_colors[2], size = 3) +
  labs(
    x = latex2exp::TeX("Bias $\\theta$"),
    y = latex2exp::TeX("Posterior probability $P_{M}(\\theta \\, | \\, D)$"),
    title = "Posterior"
  ) 

Toggle code
## ggsave("../pics/24-7-posterior-prob.pdf", width = 6, height = 4)

Prediction-based testing

This is not a common approach (in fact, I have never seen this used in practive at all), but, in principle, we can use Bayesian \(p\)-values for hypothesis testing. In the limiting case, for point-valued hypothesis \(\theta = \theta^*\) this approach reduces to frequentist testing (in many cases). Let’s therefore first recap how a “normal” \(p\)-value for the 24/7 data set is computed.

Frequentist \(p\)-value

For a frequentist hypothesis test with the null hypothesis that \(\theta = 0.5\), we can simply use:

Toggle code
binom.test(k, 24)

    Exact binomial test

data:  k and 24
number of successes = 7, number of trials = 24, p-value = 0.06391
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.1261521 0.5109478
sample estimates:
probability of success 
             0.2916667 

We can also approximate the \(p\)-value through Monte Carlo simulation, like so:

  • obtain samples of hypothetical observations \(k_{rep}\) from the likelihood function given \(\theta = 0.5\) and \(N = 24\)
  • obtain the likelihood for each \(k_{rep}\)
  • obtain the likelihood for the observed \(k_{obs}\)
  • determine the proportion of sampled \(k_rep\) whose likelihood is at least as low as that of \(k_{obs}\)
Toggle code
n_samples <- 1e+7
k_reps    <- rbinom(n_samples, 24, prob=0.5)
LH_k_reps <- dbinom(k_reps, 24, prob=0.5)
LH_k_obs  <- dbinom(7, 24, prob=0.5)
mean(LH_k_reps <= LH_k_obs)
[1] 0.0640344

Bayesian \(p\)-value

To get a \(p\) value from a Bayesian model (no matter whether prior or posterior), we follow the same recipe. The only complication is that, when using the likelihood as a test statistic as in the “exact test” above, we (usually) need to use Monte Carlo simulation to approximate the likelihood of each \(k_{rep}\).

Without BRMS

To more clearly see the logic of our sampling scheme, let’s first look at how to compute a Bayesian \(p\)-value with MC sampling without using the BRMS package. Notice that the number of samples to approximate the likelihood need not be as big (it could even be 1), as long as we compute the likelihood based on an independent sample of paramters from the model.

Toggle code
logistic <- function(x) {
  1 / (1 + exp(x))
}

logit <- function(x) {
  log(x / (1-x))
}

epsilon <- 0.01
ROPE = c(0.5 - epsilon, 0.5 + epsilon)

get_prior_sample_theta <- function(n_samples=1) {
  sample_prior_eta   <- runif(n_samples, logit(ROPE[1]), logit(ROPE[2]))
  # sample_prior_eta   <- rep(0,n_samples)
  sample_prior_theta <- logistic(sample_prior_eta)
  # print(sample_prior_theta)
  return(sample_prior_theta)
}

get_prior_sample_k <- function(n_samples=1) {
  sample_prior_theta <- get_prior_sample_theta(n_samples)
  sample_prior_k     <- map_dbl(sample_prior_theta, function(theta) rbinom(1, 24, theta))
  return(sample_prior_k)
}

get_LH_approx <- function(k, n_samples=1) {
  dbinom(k, 24, get_prior_sample_theta(n_samples), log = F) |> mean()
}

n_samples_k         <- 10000
n_samples_LH_approx <- 100

k_reps    <- get_prior_sample_k(n_samples_k)
LH_k_reps <- map_dbl(k_reps, function(k) get_LH_approx(k, n_samples_LH_approx))
LH_k_obs  <- get_LH_approx(k, 500000)
mean(LH_k_reps <= LH_k_obs)
[1] 0.0415
Exercise 1: Exploring effects of the ROPE

The Bayesian \(p\)-value for a ROPE-d hypothesis is different from the previous \(p\)-value for the point-valued hypothesis. Set a narrower ROPE to sanity-check if we retrieve the same value with this latter code as well. Can you epxlain to yourself why the wider ROPE has the categorical (!?) effect of moving the estimated \(p\)-value below the “magical” (!?!) boundary of 0.05?

Values of \(\theta\) higher than 0.5 make the observed data ever more unlikely. They therefore contribute strongly to the result.

With BRMS

We can also implement the same sampling procedure with BRMS. For the example at hand, this is actually overkill (in fact, rather inefficient). Nevertheless, this simple example shows how the logic of \(p\)-value testing can be implemented on top of BRMS also for more complicated models

First, we need to set up and obtain the model, from the prior point of view. Crucially, we need to use a binomial regression model, because we want to use the binomial distribution as the likelihood function. (This is because under \(\theta = 0.5\) every sequence of fails and success of a fixed length is exactly equally likely. It is only when we look at more global properties of the sequence (like the number of heads) that we can get a grip on whether this aspect of the sequence is compatible with \(\theta = 0.5\).)

Toggle code
data_24_7_binomial <- 
  tibble(k = 7, N = 24)

fit_logistic_prior <- brms::brm(
  formula = k | trials(N) ~ 1,
  data = data_24_7_binomial,
  family = binomial(link = "logit"),
  # this is an extremely stupid "unBayesian prior" !!!
  #   but it's the hypothesis we are interested in
  prior = brms::prior("uniform(-0.04000533, 0.04000533)", 
                      class = "Intercept", 
                      lb = -0.04000533, ub = 0.04000533),
  sample_prior = "only",
  iter = 5000,
  warmup = 1000
)

We then collect samples from the prior predictive, obtain the likelihood for all of these and compare these to the likelihood of the data, just like we did before, but now using BRMS for sampling \(k_{reps}\) and approximating their likelihood.

Toggle code
n_samples_k         <- 1e+04
n_samples_LH_approx <- 100

priorPred_samples <- tidybayes::add_predicted_draws(
  fit_logistic_prior,
  newdata = data_24_7_binomial |> select(-k),
  ndraws = n_samples_k,
  value = "k"
) |> ungroup() |> 
  select(.draw, k, N)

# extract likelihood for observations (approximated w/ MC sampling)
get_LH <- function(k, N, ndraws = n_samples_LH_approx) {
  brms::log_lik(
    object  = fit_logistic_prior,
    newdata = tibble(k = k, N = N),
    ndraws  = ndraws) |> 
    exp() |> 
    mean()
}

# get likelihood of predictive samples
# (this may take a while!!)
LH_predictions <- priorPred_samples |> 
  group_by(.draw) |> 
  summarize(LH_post_pred = get_LH(k, N)) |> 
  pull(LH_post_pred)

# get likelihood of data
LH_data <- get_LH(k = 7, N = 24, ndraws = 16000)

# Bayesian $p$-values with LH as test statistic
print(mean(LH_predictions <= LH_data))
[1] 0.0401

Comparison-based testing

While the previous approaches all just used one model, the logic of comparison-based testing is to compare two models: one model that expresses the relevant hypothesis, and another model that expresses a relevant alternative model. The choice of alternative is not innocuous, as we will see.

Bayes factors

First, we will explore comparison with Bayes Factors, in particular using the (generalized) Savage-Dickey method. Following this approach, we consider an encompassing model \(M_e\) which contains the (interval-based) null hypothesis \(I_0\) and its alternative \(I_1\) (in the sense that it puts positive probability on both). We can then compute the Bayes factor in favor of hypothesis \(I_0\) as:

\[ BF_{01} = \frac{\text{posterior odds of $I_0$ in $M_e$}}{\text{prior odds of $I_0$ in $M_e$}} \]

Here is how we can calculate this for the 24/7 data using the Beta-binomial model:

Toggle code
# set the scene
k <- 7
N <- 24
theta_null <- 0.5
epsilon <- 0.01                 # epsilon margin for ROPE
upper <- theta_null + epsilon   # upper bound of ROPE
lower <- theta_null - epsilon   # lower bound of ROPE
alpha <- 1                      # prior beta parameter
beta  <- 1                      # prior beta parameter
# calculate prior odds of the ROPE-d hypothesis
prior_of_hypothesis <- pbeta(upper, alpha, beta) - pbeta(lower, alpha, beta)
prior_odds <- prior_of_hypothesis / (1 - prior_of_hypothesis)
# calculate posterior odds of the ROPE-d hypothesis
posterior_of_hypothesis <- pbeta(upper, alpha + k, beta + (N-k)) - pbeta(lower, alpha + k, beta + (N-k))
posterior_odds <- posterior_of_hypothesis / (1 - posterior_of_hypothesis)
# calculate Bayes factor
bf_ROPEd_hypothesis <- posterior_odds / prior_odds
bf_ROPEd_hypothesis
[1] 0.5133012

And here is how this can be calculated using a logistic regression instead. We first set up the model and fit it to the data. We use a normal prior for the Intercept, with values that very roughly induces a uniform distribution on the a priori predicted central tendency (= coin bias). (NB: the prior predictive of central tendency in this model is a logit-normal distribution.) We use a lot of samples to achieve better accuracy of our estimates.

Toggle code
fit_logistic_posterior <- brms::brm(
  formula = outcome ~ 1, 
  data = data_24_7,
  family = brms::bernoulli(link = "logit"),
  prior = brms::prior("normal(0,1.8)", 
                      class = "Intercept"),
  iter = 50000,
  warmup = 1000
)

We then collect samples from the prior over parameters as well:

Toggle code
fit_logistic_prior <- stats::update(
  fit_logistic_posterior, 
  sample_prior = "only",
  iter = 50000,
  warmup = 1000)

Let’s do a quick sanity check and visualize the prior distribution of the predictor of central tendency (the coin’s bias) implied by our choice of prior on the intercept:

Toggle code
# sanity check: prior distribution of central tendency
fit_logistic_prior |> 
  tidybayes::add_epred_draws(newdata = data_24_7) |> 
  ggplot(aes(x = .epred)) + 
  geom_density()

We then approximate the probability of \(I_0\) and \(I_1\) from both the prior and posterior samples:

Toggle code
prior_samples_Intercept <- fit_logistic_prior |> 
  tidybayes::tidy_draws()

posterior_samples_Intercept <- fit_logistic_posterior |> 
  tidybayes::tidy_draws()

P_I0_prior <- mean(prior_samples_Intercept >= lower & prior_samples_Intercept <= upper )
P_I1_prior <- 1 - P_I0_prior

P_I0_posterior <- mean(posterior_samples_Intercept >= lower & posterior_samples_Intercept <= upper )
P_I1_posterior <- 1 - P_I0_posterior

# Bayes factor

(P_I0_posterior / P_I1_posterior) / (P_I0_prior / P_I1_prior)
[1] 0.3651219

LOO

Toggle code
fit_logistic_alternative <- brms::brm(
  formula = outcome ~ 1, 
  data = data_24_7,
  family = brms::bernoulli(link = "logit"),
  prior = brms::prior("normal(0,1.8)", 
                      class = "Intercept"),
  iter = 50000,
  warmup = 1000
)

fit_logistic_null <- brms::brm(
  formula = outcome ~ 1, 
  data = data_24_7,
  family = brms::bernoulli(link = "logit"),
  prior = brms::prior("uniform(0.49,0.51)", 
                      class = "Intercept",
                      lb=0.49, ub=0.51),
  iter = 50000,
  warmup = 1000
)
Toggle code
loo_comp <- loo_compare(list(
  alternative = loo(fit_logistic_alternative), 
  null = loo(fit_logistic_null)))
loo_comp
            elpd_diff se_diff
alternative  0.0       0.0   
null        -4.4       3.2   

It seems that the alternative model is better.

Checking whether this difference is substantial:

Toggle code
1 - pnorm(-loo_comp[2,1], loo_comp[2,2])
[1] 0.1192925

The brms::hypothesis function

The brms::hypothesis function allows you to test various hypotheses in a seemingly more streamlined and convenient manner. Here is how to use it. We first fit a model which collects prior and posterior samples.

Toggle code
# sample BOTH posterior and prior
fit_logistic_post_and_prior <- brms::brm(
  formula = outcome ~ 1, 
  data = data_24_7,
  family = brms::bernoulli(link = "logit"),
  prior = brms::prior("normal(0,1.8)", 
                      class = "Intercept"),
  iter = 50000,
  warmup = 1000,
  sample_prior = "yes"
)

Then we call the function to test a point-valued or an interval-based hypothesis. NB: for interval-based hypotheses, the input has to be an inequality, but a ROPE-based hypothesis is easily translated into an inequality.

Toggle code
# point-valued hypothesis test
fit_logistic_post_and_prior |> brms::hypothesis(hypothesis = "Intercept = 0")
Hypothesis Tests for class b:
       Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
1 (Intercept) = 0    -0.87      0.44    -1.77    -0.05         NA        NA
  Star
1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.
Toggle code
# ROPE-valued hypothesis test
fit_logistic_post_and_prior |> brms::hypothesis(hypothesis = "abs(Intercept -0.5) < 0.01")
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (abs(Intercept-0.... < 0     1.36      0.44     0.66     2.11          0
  Post.Prob Star
1         0     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Sadly, this function is fickle and its output can be misleading. It may be good advice to just stay away from it. Here is why.

First, if we test a point-valued hypothesis, and supply a model that has samples for both prior and posterior, it computes a Bayes Factor (using Savage-Dickey), which is displayed in the output as “Evidence Ratio”. Unfortunately, this does not work for some parameters, for which BRMS does not provide samples from the prior (when using ‘sample_prior = “yes”’; notice that we did extract prior sample using ‘sample_prior=“only”’ before). This includes our current case of a (naively construed) intercept-only model.

Second, when testing a directional hypothesis, the output shows the posterior odds in the field “Evidence Ratio” without taking the priors into account. This is misleading terminology and an unfortunate format of presentation.

Indeed, for our current running example, the function is not useful at all.