rmarkdown
working in order to ‘knit’ the HTML output.ctrl/cmd
+ shift
+ K
in RStudio) to produce a HTML file.rmarkdown
(comes with RStudio)tidyverse
(or ggplot2
, dplyr
, purrr
, tibble
)TeachingDemos
BSDA
As always, load the required packages and change the seed to your last name.
# library(tidyverse)
# library(TeachingDemos)
# library(BSDA)
#
# lastname <- "YOURLASTNAME"
#
# char2seed(lastname)
This homework sheet is about analysing IQ score results from two groups of people. One group had a placebo (sugar pill) and the other had a “smart drug” before taking the test.
You have two research questions to answer:
Is the mean IQ of the placebo group different from the supposed population mean?
Is the mean IQ of the “smart drug” group different from the placebo group?
The data file BDACM_HW2_data.csv
contains the results of the experiment.
Read the data file into a tibble called iq_data
using the function read_csv()
. Then print the tibble
# YOUR CODE HERE
Did you notice that the data frame is not “tidy”? The grouping variable is split into two different variables “placebo” and “drug” and the contents of these columns is the IQ scores.
Use the gather()
function to tidy the data frame. In this case key
should be “group” and value
should be “IQ_score”. Make sure to specify that factor_key
should be TRUE
(this will set the type of the group column to factor. Put the tidy data frame into a new variable iq_data_tidy
and print it.
# YOUR CODE HERE
Use ggplot
to make a box plot of the IQ scores of the two groups. Show groups on the x-axis and IQ_scores on the y axis. Use ylim()
of 50 to 150.
# YOUR CODE HERE
It’s often useful to plot the same data in different ways. Another way to plot this data is as a density plot. Use ggplot
to plot the density of the samples of the two groups. Plot each group in a different colour usingfill = group
. Use an alpha = 0.5
on the density so that the distributions are visible on top of each other and an xlim()
of 50 to 150.
# YOUR CODE HERE
Now that you have inspected the data visually, it’s time to take a closer look at the numbers. Use the familiar data %>% group_by() %>% summarise()
code to find the mean IQ scores and standard deviations of the two groups. Save the result into a variable called iq_summary
and print it.
# YOUR CODE HERE
To answer both of your research questions, it’s important to understand the difference between the data generation distribution and the sampling distribution.
If we suppose that IQ scores are normally distributed with a true mean of 100 and a true standard deviation of 15, then choosing a single person from the population at random is akin to sampling from a norm(100, 15).
If we sampled 21000 people and plotted their IQ scores, the resulting distribution would begin to approximate the true data generation distribution.
# # YOUR CODE HERE:
#
# mu <- ... FILL ME ...
# sigma <- ... FILL ME ...
#
# n_samples <- 21000
#
# iq_simulation <- data_frame(sample = 1:n_samples,
# iq = NA)
#
# for(i in 1:n_samples){
#
# # sample a set of values
# sample_value <- rnorm(n = 1,
# mean = mu,
# sd = sigma)
#
# # save value to data frame
# iq_simulation$iq[i] <- sample_value
# }
#
# iq_simulation %>%
# ggplot(aes(x = iq)) +
# geom_histogram(binwidth = 1) +
# scale_x_continuous(limits = c(50, 150)) +
# xlab("IQ score") +
# ylab("Frequency") +
# ylim(0,1000) +
# theme_classic()
However, in reality, our sample sizes are nowhere near large enough to approximate the true data generation distribution. Instead, we aggregate the sample into a single value (the mean).
You can then imagine a hypothetical distribution of sample means, from which a single sample mean is drawn. This is called the sampling distribution.
A sampling distribution is the hypothetical distribution from which a sample statistic (e.g. \(\bar{X}\)) is drawn. The mean of the sampling distribution is equal to the true mean of the population parameter.
We can run a simulation to visualise the sampling distribution. Suppose we have the same data generating distribution norm(100, 15).
If we repeatedly (say 1500 times) sample 14 people at random and calculate the mean IQ of each sample, the distribution of means will approximate the sampling distribution.
# # YOUR CODE HERE
#
# mu <- ... FILL ME ... # true population mean
# sigma <- ... FILL ME ... # true standard deviation
#
# sample_size <- ... FILL ME ... # number of people in each sample
# n_samples <- ... FILL ME ... # number of samples
#
# sample_means <- data_frame(sample = 1:n_samples,
# mean = NA,
# std_dev = NA)
#
# for(i in 1:n_samples){
#
# # sample a set of values
# sample_values <- rnorm(n = sample_size,
# mean = mu,
# sd = sigma)
#
# # get the mean of the sample
# sample_means$mean[i] <- mean(sample_values)
#
# # get the standard deviation of the samle
# sample_means$std_dev[i] <- sd(sample_values)
# }
#
# sample_means %>%
# ggplot(aes(mean)) +
# geom_histogram(binwidth = 1) +
# scale_x_continuous(limits = c(50, 150)) +
# xlab("Mean IQ score of sample") +
# ylab("Frequency") +
# ylim(0, 200) +
# theme_classic()
The sampling distribution is much narrower than the data generating distribution. Why do you think this is?
ANSWER:
(your answer here)
To answer your first research question (whether the mean IQ in the placebo group is higher than the expected population mean), you need to perform a one-sample test.
A \(z\) transformation is a standardization of a normal distribution such that the mean becomes 0 and the standard deviation becomes 1. We can convert the previously simulated sample mean distribution to a \(z\) distrubition like so. Fill in the correct formula for the z-transformation of sample means.
# FILL IN THE CODE:
# sample_means <- sample_means %>%
# mutate(z_score = ... FILL ME... )
#
# sample_means %>%
# ggplot(aes(x = z_score)) +
# geom_histogram(binwidth = 0.1) +
# scale_x_continuous(limits = c(-5, 5)) +
# xlab("z-score of sample mean") +
# ylab("Frequency") +
# theme_classic() +
# ylim(0, 100)
Now that we have seen the connection between the sampling distribution and a derived z distribution, we can use this information for a z-test.
Suppose that you want to know whether the mean of your sample for the placebo group is different from the expected population mean (100). You can do this by seeing the probability of drawing the mean of the placebo group from the sampling distribution.
Historically, you would first convert your sample mean to a z-score. This allows for easy probability calculation (e.g. using a look up table), because the probabilities for many z scores are known.
What is the z-score for the mean of the placebo group, if we assume \(\mu = 100\) and \(\sigma = 15\)?
# YOUR CODE HERE:
(your answer here)
We now have a z-score for our sample mean, but how improbable is it? Write code to find the two-tailed p-value for the calculated z-score.
# YOUR CODE HERE:
What is the p-value? What does this mean?
(your answer here)
The package BSDA
has a z.test()
function. Steps b and c can be replaced with it.
# # YOUR CODE HERE:
# placebo_data <- iq_data_tidy %>%
# filter(group == ... FILL ME ...)
#
# placebo_data$IQ_score %>%
# z.test(mu = ... FILL ME ..., sigma.x = ... FILL ME ...)
\(z\)-tests are a nice tool when we know the true standard deviation, however, often this is not the case. In part 2, we assumed the true standard deviation was 15 (because that’s the standard for IQ scores). However, we probably should not rely on the standard deviation being known in this case. (Actually, the data were generated from a known mean and standard deviation which will be revealed in class)
In cases where the standard deviation is unknown, it can estimated based on the standard deviation of the sample. We can use the simulated sample means and calculate a t-score for each, and so make an approximate plot of the sampling distribution.
# # YOUR CODE HERE:
#
# sample_means <- sample_means %>%
# mutate(t_score = ... FILL ME ... )
#
# sample_means %>%
# ggplot( ... FILL ME ... ) +
# geom_histogram(binwidth = 0.1) +
# scale_x_continuous(limits = c(-5, 5)) +
# xlab("t-score of sample mean") +
# ylab("Frequency") +
# ylim(0, 100) +
# theme_classic()
The \(t\)-distribution looks quite similar to the z-distribution. The main difference is that the “tails” are a bit larger. Actually, the \(t\)-distribution gets closer to a z-distribution with increasing “degrees of freedom” (in this case just the sample size minus one).
Just like the z-test, we can perform a t-test on our placebo sample, comparing it to the expected mean of 100. Calculate the test statistic (\(t\)-score) for the observed data.
# YOUR CODE HERE:
As before, we can compute the \(p\)-value of the data we observed, as the probability of observing a \(t\)-score that is not more extreme (= bigger) than the \(t\)-score of the observed data. Use the built-in \(t\)-distribution with function pt()
to calculate the \(p\)-value
# # YOUR CODE HERE:
Base R has a t.test()
function. Steps b and c can be replaced with it.
# # YOUR CODE HERE:
In order to answer the second research question (whether or not the mean IQ in the “smart drug” group is different from the placebo group), we need to perform a two-sample test. We have not yet discussed the two-sample \(t\)-test in class, but it follows the exact same general logic. The purpose of this exercise is to run you through the two-sample \(t\)-test.
The rationale behind the two-sample \(t\)-test is to determine whether two sets of (unpaired) samples are likely to come from the same population or not. (NB: Notice that there are actually two kinds of two-sample \(t\)-tests, one for paired samples, one for unpaired samples. We have paired samples if we have a measurement of the same individual in two experimental conditions. We have unpaired samples if we have measurements from different individuals in each group.) Formally, we have two random variables \(X\) and \(Y\). The \(n_X\) data points from the placebo group are iid samples from random variable \(X\); the \(n_Y\) data points from the smart-drug group are iid samples from the random variable \(Y\). We assume that both \(X\) and \(Y\) follow a normal distribution with means \(\mu_X\) and \(\mu_Y\) and standard deviations \(\sigma_X\) and \(\sigma_Y\). We do not assume that standard deviations \(\sigma_X\) and \(\sigma_Y\) are known, but we do assume that they are identical: \(\sigma_X = \sigma_Y\). We are then interested in testing the null-hypothesis:
\[H_0 \colon \mu_X = \mu_Y\]
The corresponding research hypothesis is:
\[H_0 \colon \mu_X \neq \mu_Y\]
If the null-hypothesis is true, it can be shown that the sampling distribution:
\[ T = \sqrt{\frac{n_X \ n_Y \ (n_X + n_Y - 2)}{n_X + n_Y}} \cdot \frac{\bar{X} - \bar{Y}}{\sqrt{(n_X -1) S^2_X + (n_Y -1) S^2_Y}} \]
follows a \(t\)-distribution with \(f = n_X + n_Y -2\) degrees of freedom. (NB: In the formula above, as before, \(\bar{x}\) is the sample mean, and \(s\) is the sample standard deviation.)
Like before, we use the above formula, defining the sampling distribution, to compute the test statistic for the observed data.
## YOUR CODE HERE
#
# n_X <- nrow(iq_data_tidy)/2
# n_Y <- ...
# x_bar <- ...
# y_bar <- ...
# sd_X <- filter(iq_summary, group == 'placebo')$standard_deviation
# sd_Y <- ...
#
# t_score <- ...
#
# t_score
Based on what we know about the distribution of \(T\) (that is follows a \(t\)-distribution with \(f = n_X + n_Y - 2\) degrees of freedom), calculate the probability of seeing an outcome at least as extreme as the observed data.
# YOUR CODE HERE
t.test()
The t.test()
function can also perform two sample tests. Make sure that you set the options for an unpaired test with equal variance.
# # YOUR CODE HERE:
# t.test( ... FILL ME ... , var.equal = TRUE, paired = FALSE)
If everything went well, you obtained a \(p\)-value of around \(0.06\). Suppose that your research hypothesis is: “This smart drug influences cognitive performance, and therefore leads to a difference in measured IQ scores (higher or lower, we do not care, just some difference).” Based on your test result, which of the following statements are true. (Present your answers by typing on separate lines strings like “(xxii) false” with one line for each consecutive statement. We assume the standard \(\alpha\) level of \(0.05\).)
Are each of these true or false?
ANSWER:
Since the frequentist paradigm is all about counterfactual data (i.e., data which we could have seen but have not), suppose the \(p\) value had turned out to be \(0.03\). Which of the following statements are true/false then? (Present your answers by typing on separate lines strings like “(xxii) false” with one line for each consecutive statement. We assume the standard \(\alpha\) level of \(0.05\).)
Are each of these true or false?
ANSWER:
End of assignment