Stochastic independence

Bayesian regression: theory & practice

Author

Michael Franke

Preamble

Here is code to load (and if necessary, install) required packages, and to set some global options (for plotting and efficient fitting of Bayesian models).

Toggle code

# install packages from CRAN (unless installed)
pckgs_needed <- c(
  "tidyverse",
  "brms",
  "rstan",
  "rstanarm",
  "remotes",
  "tidybayes",
  "bridgesampling",
  "shinystan",
  "mgcv"
)
pckgs_installed <- installed.packages()[,"Package"]
pckgs_2_install <- pckgs_needed[!(pckgs_needed %in% pckgs_installed)]
if(length(pckgs_2_install)) {
  install.packages(pckgs_2_install)
} 

# install additional packages from GitHub (unless installed)
if (! "aida" %in% pckgs_installed) {
  remotes::install_github("michael-franke/aida-package")
}
if (! "faintr" %in% pckgs_installed) {
  remotes::install_github("michael-franke/faintr")
}
if (! "cspplot" %in% pckgs_installed) {
  remotes::install_github("CogSciPrag/cspplot")
}

# load the required packages
x <- lapply(pckgs_needed, library, character.only = TRUE)
library(aida)
library(faintr)
library(cspplot)

# these options help Stan run faster
options(mc.cores = parallel::detectCores())

# use the CSP-theme for plotting
theme_set(theme_csp())

# global color scheme from CSP
project_colors = cspplot::list_colors() |> pull(hex)
# names(project_colors) <- cspplot::list_colors() |> pull(name)

# setting theme colors globally
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = project_colors)
}
scale_fill_discrete <- function(...) {
   scale_fill_manual(..., values = project_colors)
}

Using regression modeling to test for stochastic independence

Here are three data sets, each with triples of associated observations for variables \(X\), \(Y\), and \(Z\). Ideally, you do not look at the code, so you will have more fun with the exercise below.

Toggle code

##################################################
# creating three data sets with various dependencies
##################################################

data_1 <- tibble(
  X = rnorm(500),
  Z = map_dbl(X, function(i) rnorm(1, mean = i)),
  Y = map_dbl(Z, function(i) rnorm(1, mean = i))
)

data_2 <- tibble(
  Z = rnorm(500),
  X = map_dbl(Z, function(i) rnorm(1, mean = i)),
  Y = map_dbl(Z, function(i) rnorm(1, mean = i))
)

data_3 <- tibble(
  X = rnorm(500),
  Y = rnorm(500),
  Z = map2_dbl(X, Y, function(i,j) rnorm(1, mean = i+j)),
)

Exercise 1: Use regression to assess stochastic independence

Use linear regression to investigate whether:

\(X\) and \(Y\) are stochastically independent
\(X\) and \(Y\) are stochastically independent given Z

in each of these data sets.

Ideally, write a single function that takes a data set as input and output information that tells you about whether \(X\) and \(Y\) are independent.

Solution

Here is a reusable function:

Toggle code

test_stochDep <- function(input_data) {
  
  fit_simple   <- brms::brm(Y ~ X, data = input_data)
  fit_multiple <- brms::brm(Y ~ X + Z, data = input_data)
  
  result_simple <- brms::hypothesis(fit_simple, "X>0")
  result_multiple <-brms::hypothesis(fit_multiple, "X>0")
  
  return(list(simple = result_simple, multiple = result_multiple))
}

result_1 = test_stochDep(data_1)
result_2 = test_stochDep(data_2)
result_3 = test_stochDep(data_3)

In data set 1, \(X\) and \(Y\) are not stochastically independent, but turn independent given \(Z\).

Toggle code

result_1

$simple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0     1.06      0.06     0.96     1.17        Inf         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

$multiple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0    -0.03      0.06    -0.13     0.08       0.53      0.35     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

In data set 2, \(X\) and \(Y\) are not stochastically independent, and turn independent given \(Z\).

Toggle code

result_2

$simple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0     0.53      0.04     0.46     0.59        Inf         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

$multiple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0        0      0.05    -0.07     0.08       0.99       0.5     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

In data set 3, \(X\) and \(Y\) are stochastically independent, but cease to be given \(Z\).

Toggle code

result_3

$simple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0    -0.05      0.05    -0.13     0.03       0.15      0.13     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

$multiple
Hypothesis Tests for class b:
  Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1    (X) > 0     -0.5      0.04    -0.57    -0.44          0         0     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Exercise 2: Interpret your results

Infer which (if any) of the three data sets was created with one of the following causal structures:

collider: \(X, Y \rightarrow Z\)
chain: \(X \rightarrow Z \rightarrow Y\)
fork: \(Z \rightarrow X,Y\)

Solution

Data sets 1 and 2 could be from a chain or a fork, because \(X\) and \(Y\) are not stochastically independent, but turn independent given \(Z\). We cannot say which based on these tests.

Data set 3 could be from a collider, because \(X\) and \(Y\) are stochastically independent, but cease to be given \(Z\).