rmarkdown
working in order to ‘knit’ the HTML output.ctrl/cmd
+ shift
+ K
in RStudio) to produce a HTML file.rmarkdown
(comes with RStudio)tidyverse
(or ggplot2
, dplyr
, purrr
, tibble
)TeachingDemos
rstan
ggmcmc
bridgesampling
polspline
As always, load the required packages and change the seed to your last name.
library(TeachingDemos)
library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.18.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
library(ggmcmc)
library(bridgesampling)
library(polspline)
# set cores to use to the total number of cores (minimally 4)
options(mc.cores = max(parallel::detectCores(), 4))
# save a compiled version of the Stan model file
rstan_options(auto_write = TRUE)
lastname <- "YOURLASTNAME"
char2seed(lastname)
Here is some data from a forgetting experiment. t
is the time point, and y
is the proportion of people who correctly recalled the stimulus.
y <- c(.94, .77, .40, .26, .24, .16)
t <- c(1, 3, 6, 9, 12, 18)
obs <- y * 100
forgetting_data <- tibble(obs = obs,
N = 100,
t = t)
Below is the code for the power model of forgetting we discussed in class: \[P(\text{recall} | t; c, d) = ct^{-d}\] where \(0 < c, d < 1.5\)
Implement an alternative model based on this exponential model for each participant:
\[P(\text{recall} | t; a, b) = a \exp(-bt)\] where \(0 < a, b < 1.5\)
power_model_str <- "
// power model of forgetting data
data {
int obs[6] ;
real t[6] ;
}
parameters {
real<lower=0,upper=1.5> c ;
real<lower=0,upper=1.5> d ;
}
transformed parameters {
real theta[6];
for (i in 1:6) {
theta[i] = c * t[i]^(-d) ;
}
}
model {
// increment target log score with
// probability of current samples from prior
target += uniform_lpdf(c | 0, 1.5);
target += uniform_lpdf(d | 0, 1.5);
// increment target log score with
// likelihood of data given current parameter samples
target += binomial_lpmf(obs | 100, theta);
}
"
exponential_model_str <- "
// exponential model of forgetting data
// STAN CODE HERE
"
For both models, complete the code below to draw samples from the posterior distribution. NB: you should use a large amount of samples (as indicated) so that later bridge sampling is reliable. We use the control
parameter here to fine-tune the inference for these models. Remember to set eval = TRUE
.
# complete the code below
stanfit_exponential <- stan(# specify model
iter = 50000,
warmup = 10000,
control = list(adapt_delta = 0.999),
pars = , # replace NA with parameters to inspect
data = ## provide data
)
stanfit_power <- stan(# specify model
iter = 50000,
warmup = 10000,
control = list(adapt_delta = 0.999),
pars = , # replace NA with parameters to inspect
data = ## provide data
)
Show a summary for each fit
# YOUR CODE HERE
Make traceplots for the samples for both models
# YOUR CODE HERE
Plot the density for posterior samples for both models
# YOUR CODE HERE
Approximate the marginal likelihood of each model using the bridgesampling
package (Hint: use bridgesampling::bridge_sampler()
as we did in the lecture, and use option silent = T
to suppress output in the markdown).
# YOUR CODE HERE
Determine the error of the estimated marginal likelihoods using bridgesampling::error_measures
.
# YOUR CODE HERE
The previous results only give you an estimate for the marginal likelihood of the data given each model. But you can use these estimates to compute the Bayes factor in favour of the exponential model.
# YOUR CODE HERE
Interpret the Bayes factor. What is the value? What does it mean?
ANSWER:
(your answer here)
Use the Savage-Dickey method to compute the Bayes factor in favor of the exponential model, when compared against a special case (properly nested model) of the exponential model where a = 1.
Fill in the required code:
# NB: we take only a subsest of the samples because otherwise polspline will not work
post_samples <- filter(ggs(stanfit_exponential),
Parameter == "a" & Iteration <= 1000)$value
fit.posterior <- polspline::oldlogspline(post_samples,
ubound=1.5, lbound=0)
posterior_a <- polspline::doldlogspline(1, fit.posterior)
# calculate the Bayes factor via the Savage-Dickey method
# YOUR CODE HERE
Interpret the result
Is this strong or mild evidence in favor of one of the two models?
ANSWER:
(your answer here)
We want to approximate the posterior predictive distribution. For this, we draw samples from the posterior, using basically the same scripts as above for posterior estimation. Additionally we generate a “sample imaginary prediction” of data that we’d expect to see in an exact repetition of the experiment, for each sampled vector of parameter values. To do so, we use the generated parameters
block at the end of the Stan model code.
Complete the pseudo-code below for the exponential and the power model with actual Stan code which does that. (Hint: use command binomial_rng
to take samples from a binomial distribution; check the Stan manual for its syntax.)
power_pred_model_str <- "
// Stan code here
// fill in the gaps and replace the pseudo-code
data {
// as before
}
parameters {
// as before
}
transformed parameters {
// as before
}
model {
// as before
}
## PSEUDO-CODE!
generated quantities {
// fill in for yourself
// DECLARE VARIABLE 'post_predict_sample' WITH SAME SHAPE AS THE DATA
// post_predict_sample = SAMPLE FROM BINOMIAL DISTRIBUTION, WITH CURRENT PARAMETER ESTIMATES
}
"
exponential_pred_model_str <- "
// Stan code here
// fill in the gaps and replace the pseudo-code
data {
// as before
}
parameters {
// as before
}
transformed parameters {
// as before
}
model {
// as before
}
## PSEUDO-CODE!
generated quantities {
// fill in for yourself
// DECLARE VARIABLE 'post_predict_sample' WITH SAME SHAPE AS THE DATA
// post_predict_sample = SAMPLE FROM BINOMIAL DISTRIBUTION, WITH CURRENT PARAMETER ESTIMATES
}
"
Plot the data and a visual posterior predictive check
Use the samples from variable post_predict_sample
to generate plots for a visual posterior predictive check, which should similar what we had in the lecture. Use y limits of 0 to 100.
Hint: use geom_point()
for the real data and geom_pointrange()
for the predicted values and 95% credible intervals.
# YOUR CODE HERE
To compute the posterior predictive \(p\)-value most efficiently (in the sense of having to write least code; not necessarily in the sense of fastest run time), we will expand the Stan code even more. Based on your previous code which computes samples from the posterior predictive distribution, expand the Stan code even more to also generate two more quantities. The first is the likelihood of the observed data under the currently sampled parameter values. The second is the likelihood of the post_predict_sample
under the currently sampled parameter value. The posterior predictive \(p\)-value is approximated by the mean of the variable post_pred_p_value_flag
in the pseudo-code below.
Complete the code accordingly.
power_ppp_model_str <- "
// fill in the gaps and replace the pseudo-code
data {
// as before
}
parameters {
// as before
}
transformed parameters {
// as before
}
model {
// as before
}
## PSEUDO-CODE!
generated quantities {
// fill in for yourself
// DECLARE VARIABLE 'post_predict_sample' WITH SAME SHAPE AS THE DATA
// post_predict_sample = SAMPLE FROM BIOMIAL DISTRIBUTION, WITH CURRENT PARAMETER ESTIMATES
// likelihood_data = binomial_lpmf( ... FILL ME ... )
// likelihood_post_predict_sample = binomial_lpmf( ... FILL ME ... )
// post_pred_p_value_flag = likelihood_data >= likelihood_post_predict_sample;
}
"
exponential_ppp_model_str <- "
// fill in the gaps and replace the pseudo-code
data {
// as before
}
parameters {
// as before
}
transformed parameters {
// as before
}
model {
// as before
}
## PSEUDO-CODE!
generated quantities {
// fill in for yourself
// DECLARE VARIABLE 'post_predict_sample' WITH SAME SHAPE AS THE DATA
// post_predict_sample = SAMPLE FROM BIOMIAL DISTRIBUTION, WITH CURRENT PARAMETER ESTIMATES
// likelihood_data = binomial_lpmf( ... FILL ME ... )
// likelihood_post_predict_sample = binomial_lpmf( ... FILL ME ... )
// post_pred_p_value_flag = likelihood_data >= likelihood_post_predict_sample;
}
"
Retrieve estimates of the posterior predictive \(p\)-values for both the exponential and power models.
# YOUR CODE HERE
# Run the stan models and retrieve an estimate of the posterior predictive
# p-value for both exponential and power model.
Can either of the models be criticised based on these values? If so, which and why?
ANSWER:
(your answer here)
End of assignment