Instructions

Grading scheme

Suggested readings

Required R packages

As always, load the required packages and change the seed to your last name.

library(TeachingDemos)
library(ggm)
library(maps)
library(tidyverse)
library(rstan)
library(ggmcmc)

# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4)) 
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)

lastname <- "YOURLASTNAME"

char2seed(lastname)
library(TeachingDemos)
library(ggm)
library(maps)
library(tidyverse)
library(rstan)
library(ggmcmc)

# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4)) 
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)


lastname <- "bayes"

char2seed(lastname)

2. Inferring a difference between means

Previously during this course, we have looked at a particular instantiation of a \(t\)-test as tools to decide whether to retain or abandon a null hypothesis that the means of two sets of samples are equal. Let’s now use Bayesian parameter inference to address the same kind of question.

Here is some data:

set.seed(2018)
N1 <- 20
N2 <- 20
dataList <- list(y1 = rnorm(N1, mean = 100, sd = 20), 
                 y2 = rnorm(N2, mean = 107, sd = 20),
                 N1 = N1,
                 N2 = N2)

Here is a (clumsy!) Stan model for this data. Notice that we estimate the means of both vectors/groups independently and use the generated quantities block to obtain samples to approximate our posterior beliefs about the difference between these means.


data {
  int N1;
  int N2;
  real y1[N1] ;
  real y2[N2] ;
}
parameters {
  real mu1 ;
  real mu2 ;
  real sd1 ;
  real sd2 ;
} 
model {
  mu1 ~ normal(100, 100); // diffuse but roughly accurate priors
  mu2 ~ normal(100, 100); 
  sd1 ~ gamma(1, 100); // sds are assumed to be low a priori
  sd2 ~ gamma(1, 100); 
  y1 ~ normal(mu1,sd1);
  y2 ~ normal(mu2,sd2);
}
generated quantities {
  real delta;
  delta = mu1 - mu2;
}

a. Run this model by completing the following code and produce a summary of the fit.

model_string <-  "

paste the model code here

"

fit <- stan(model_code = model_string,
           iter = 10000,
           data = dataList)
model_string <-   "
data {
  int N1;
  int N2;
  real y1[N1] ;
  real y2[N2] ;
}
parameters {
  real mu1 ;
  real mu2 ;
  real sd1 ;
  real sd2 ;
} 
model {
  mu1 ~ normal(100, 100); // diffuse but roughly accurate priors
  mu2 ~ normal(100, 100); 
  sd1 ~ gamma(1, 100); // sds are assumed to be low a priori
  sd2 ~ gamma(1, 100); 
  y1 ~ normal(mu1,sd1);
  y2 ~ normal(mu2,sd2);
}
generated quantities {
  real delta;
  delta = mu1 - mu2;
}

"
  


stan_fit <- stan(model_code = model_string,
           iter = 10000,
           data = dataList)
## recompiling to avoid crashing R session
summary(stan_fit)
##  Length   Class    Mode 
##       1 stanfit      S4

b. Based on which information in this summary would you conclude that it is plausible to assume that the samples are representative of the target (posterior) distribution?

ANSWER:

SOLUTION: We look at the R-hat values and see that they are all under 1.1. This is an indicator that the chains converged, so we can believe that the samples are representative of the posterior distribution based on the model.

c. Produce a plot of the posterior density of all model parameters.

# your code here
ggplot(ggs(stan_fit), aes(x = value)) + geom_density() + facet_wrap(.~Parameter, scales = "free")

plot(stan_fit, plotfun = "dens")

d. Comment on each of the parameters whether the inferred values are reasonable given the data (of which we know the distributions that generated them) and the model.

SOLUTION: The inferred means are close to the ‘true means’, consequently so is delta. The standard deviations are estimated to be perceptible lower because of the rather strong prior assumption that they are small.

e. Calculate a 95% HDI for the posterior of delta. (Hint: we first extract the samples from the stan_fit object into a tibble, and then use a function from the packages HDInterval, which you may have to install.)

samples <- ggmcmc::ggs(stan_fit)
HDInterval::hdi( ... something with samples and delta ... )
samples <- ggmcmc::ggs(stan_fit)
HDInterval::hdi(filter(samples, Parameter == "delta") %>% select(value))
##            value
## lower -12.387110
## upper  -7.164576
## attr(,"credMass")
## [1] 0.95

f. Interpret the result from the HDI computation. If the model is true, should we conclude from the data that there is likely a difference in means?

SOLUTION: The value 0 lies way outside the 95% HDI of the posterior of delta, so we are safe to conclude that, given data and model, it is not very plausible that the means are identical.

g. Here is the output of the corresponding \(t\)-test in a frequentist paradigm.

with(dataList, t.test(y1,y2, paired = FALSE, var.equal = FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -1.5525, df = 37.525, p-value = 0.1289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22.492760   2.971995
## sample estimates:
## mean of x mean of y 
##  97.17022 106.93060

How would you interpret this result? Would you reject the null hypothesis (that the means are equal)?

SOLUTION: No, in this case we have no grounds to reject the null hypothesis.


End of assignment