Instructions

Work on your own, this is an individual assignment. Discussing concepts is perfectly fine, but your answers should be your own.
Make sure you have R and RStudio installed. If you are an advanced user and aren’t using RStudio, make sure you have rmarkdown working in order to ‘knit’ the HTML output.
Download this zip file. This contains the R Markdown files you will edit. There are three separate files.
Open the .Rmd files in RStudio.
Fill in your name in the ‘author’ heading.
Fill in the required code and answers.
‘Knit’ the document (ctrl/cmd + shift + K in RStudio) to produce a HTML file.
Create a ZIP archive called “BDACM_HW3-LastnameFirstname.zip” containing:
- three RMarkdown files:
  - “BDACM_HW3_ex1-LastnameFirstname.Rmd”
  - “BDACM_HW3_ex2-LastnameFirstname.Rmd”
  - “BDACM_HW3_ex3-LastnameFirstname.Rmd”
- three rendered HTML documents:
  - “BDACM_HW3_ex1-LastnameFirstname.html”
  - “BDACM_HW3_ex2-LastnameFirstname.html”
  - “BDACM_HW3_ex3-LastnameFirstname.html”
Upload the ZIP archive through the ‘Tasks’ page on Stud.IP before the deadline. You may upload as many times as you like before the deadline, only your final submission will count.

Grading scheme

Total points: 17 (1 point for each question) and 1 point for correctly following submission instructions (correct naming of the ZIP file, correct contents, etc.)
For questions that have a coding part and an interpretation part, the total points are divided equally between the two.

Required R packages

rmarkdown (comes with RStudio)
tidyverse (or ggplot2, dplyr, purrr, tibble)
TeachingDemos
ggm
maps
rstan
ggmcmc ___

As always, load the required packages and change the seed to your last name.

library(TeachingDemos)
library(ggm)
library(maps)
library(tidyverse)
library(rstan)
library(ggmcmc)

# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4)) 
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)

lastname <- "YOURLASTNAME"

char2seed(lastname)

library(TeachingDemos)
library(ggm)
library(maps)
library(tidyverse)
library(rstan)
library(ggmcmc)

# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4)) 
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)


lastname <- "bayes"

char2seed(lastname)

2. Inferring a difference between means

Previously during this course, we have looked at a particular instantiation of a \(t\)-test as tools to decide whether to retain or abandon a null hypothesis that the means of two sets of samples are equal. Let’s now use Bayesian parameter inference to address the same kind of question.

Here is some data:

set.seed(2018)
N1 <- 20
N2 <- 20
dataList <- list(y1 = rnorm(N1, mean = 100, sd = 20), 
                 y2 = rnorm(N2, mean = 107, sd = 20),
                 N1 = N1,
                 N2 = N2)

Here is a (clumsy!) Stan model for this data. Notice that we estimate the means of both vectors/groups independently and use the generated quantities block to obtain samples to approximate our posterior beliefs about the difference between these means.


data {
  int N1;
  int N2;
  real y1[N1] ;
  real y2[N2] ;
}
parameters {
  real mu1 ;
  real mu2 ;
  real sd1 ;
  real sd2 ;
} 
model {
  mu1 ~ normal(100, 100); // diffuse but roughly accurate priors
  mu2 ~ normal(100, 100); 
  sd1 ~ gamma(1, 100); // sds are assumed to be low a priori
  sd2 ~ gamma(1, 100); 
  y1 ~ normal(mu1,sd1);
  y2 ~ normal(mu2,sd2);
}
generated quantities {
  real delta;
  delta = mu1 - mu2;
}

a. Run this model by completing the following code and produce a summary of the fit.

model_string <-  "

paste the model code here

"

fit <- stan(model_code = model_string,
           iter = 10000,
           data = dataList)

model_string <-   "
data {
  int N1;
  int N2;
  real y1[N1] ;
  real y2[N2] ;
}
parameters {
  real mu1 ;
  real mu2 ;
  real sd1 ;
  real sd2 ;
} 
model {
  mu1 ~ normal(100, 100); // diffuse but roughly accurate priors
  mu2 ~ normal(100, 100); 
  sd1 ~ gamma(1, 100); // sds are assumed to be low a priori
  sd2 ~ gamma(1, 100); 
  y1 ~ normal(mu1,sd1);
  y2 ~ normal(mu2,sd2);
}
generated quantities {
  real delta;
  delta = mu1 - mu2;
}

"
  


stan_fit <- stan(model_code = model_string,
           iter = 10000,
           data = dataList)

## recompiling to avoid crashing R session

summary(stan_fit)

##  Length   Class    Mode 
##       1 stanfit      S4

b. Based on which information in this summary would you conclude that it is plausible to assume that the samples are representative of the target (posterior) distribution?

ANSWER:

SOLUTION: We look at the R-hat values and see that they are all under 1.1. This is an indicator that the chains converged, so we can believe that the samples are representative of the posterior distribution based on the model.

c. Produce a plot of the posterior density of all model parameters.

# your code here

ggplot(ggs(stan_fit), aes(x = value)) + geom_density() + facet_wrap(.~Parameter, scales = "free")

plot(stan_fit, plotfun = "dens")

d. Comment on each of the parameters whether the inferred values are reasonable given the data (of which we know the distributions that generated them) and the model.

SOLUTION: The inferred means are close to the ‘true means’, consequently so is delta. The standard deviations are estimated to be perceptible lower because of the rather strong prior assumption that they are small.

e. Calculate a 95% HDI for the posterior of `delta`. (Hint: we first extract the samples from the `stan_fit` object into a tibble, and then use a function from the packages `HDInterval`, which you may have to install.)

samples <- ggmcmc::ggs(stan_fit)
HDInterval::hdi( ... something with samples and delta ... )

samples <- ggmcmc::ggs(stan_fit)
HDInterval::hdi(filter(samples, Parameter == "delta") %>% select(value))

##            value
## lower -12.387110
## upper  -7.164576
## attr(,"credMass")
## [1] 0.95

f. Interpret the result from the HDI computation. If the model is true, should we conclude from the data that there is likely a difference in means?

SOLUTION: The value 0 lies way outside the 95% HDI of the posterior of delta, so we are safe to conclude that, given data and model, it is not very plausible that the means are identical.

g. Here is the output of the corresponding \(t\)-test in a frequentist paradigm.

with(dataList, t.test(y1,y2, paired = FALSE, var.equal = FALSE))

## 
##  Welch Two Sample t-test
## 
## data:  y1 and y2
## t = -1.5525, df = 37.525, p-value = 0.1289
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -22.492760   2.971995
## sample estimates:
## mean of x mean of y 
##  97.17022 106.93060

How would you interpret this result? Would you reject the null hypothesis (that the means are equal)?

SOLUTION: No, in this case we have no grounds to reject the null hypothesis.

End of assignment

Homework 3 – MCMC and Stan

Due: Friday, December 21 23:59CET

Instructions

Grading scheme

Suggested readings

Required R packages

2. Inferring a difference between means

a. Run this model by completing the following code and produce a summary of the fit.

b. Based on which information in this summary would you conclude that it is plausible to assume that the samples are representative of the target (posterior) distribution?

c. Produce a plot of the posterior density of all model parameters.

d. Comment on each of the parameters whether the inferred values are reasonable given the data (of which we know the distributions that generated them) and the model.

e. Calculate a 95% HDI for the posterior of `delta`. (Hint: we first extract the samples from the `stan_fit` object into a tibble, and then use a function from the packages `HDInterval`, which you may have to install.)

f. Interpret the result from the HDI computation. If the model is true, should we conclude from the data that there is likely a difference in means?

g. Here is the output of the corresponding \(t\)-test in a frequentist paradigm.

Homework 3 – MCMC and Stan

Due: Friday, December 21 23:59CET

Instructions

Grading scheme

Suggested readings

Required R packages

2. Inferring a difference between means

a. Run this model by completing the following code and produce a summary of the fit.

b. Based on which information in this summary would you conclude that it is plausible to assume that the samples are representative of the target (posterior) distribution?

c. Produce a plot of the posterior density of all model parameters.

d. Comment on each of the parameters whether the inferred values are reasonable given the data (of which we know the distributions that generated them) and the model.

e. Calculate a 95% HDI for the posterior of delta. (Hint: we first extract the samples from the stan_fit object into a tibble, and then use a function from the packages HDInterval, which you may have to install.)

f. Interpret the result from the HDI computation. If the model is true, should we conclude from the data that there is likely a difference in means?

g. Here is the output of the corresponding \(t\)-test in a frequentist paradigm.

e. Calculate a 95% HDI for the posterior of `delta`. (Hint: we first extract the samples from the `stan_fit` object into a tibble, and then use a function from the packages `HDInterval`, which you may have to install.)