This homework assignment is due May 12th 2017 before class. Please email a *.Rmd file and a compiled output, preferably HTML to Christian Adam. Make sure that the text explains your code and plots; the explanation you give is equally important, if not more so. Use R for the tasks below. Plot using ggplot2. Make use of the tidyverse wherever you can, in particular use piping (%>%) and the dplyr functions filter(), mutate(), arrange(), select() and summarise(). Keep your descriptions and answers as short and concise as possible, without reverting to bullet points. Plots will be scored based on how sensible, clear and aesthetically pleasing they are.

Submit your results in a zipped archive with name BDA+CM_HW1_YOURLASTNAME.zip. Use the same naming scheme fro the *.Rmd and *.html files. All other files should start with BDA+CM_HW1_YOURLASTNAME_ as well.

There are four exercises below. You need to complete three of them. If you complete four, you get the points for your three best exercises. All three exercises, no matter which ones you choose, contribute in equal parts to the final score.

Exercise 1: tidying up the in-class enquete

For this exercise, you will need the files 01_merged_enquetes.csv and 01_participant_info.rds, which find on in our GitHub repository in folder data. 01_merged_enquetes.csv is a concatenation of the csv files everyone filled out with an extra column ID which uniquely identifies the file each row came from. To retain anonymity, usernames that matched a participant’s actual name were replaced by their MD5 hash. participant_info.rds contains the gender and the message body (sans name) of each csv file sent in along with the file identifier.

Read 01_merged_enquetes.csv into a data frame, using readr::read_csv(). Make sure you understand what happens with strings and factors when you do this and check whether this is what you want later on.
When dealing with data, we need to be as sure as possible that it has the structure we assume, and for large data sets, this cannot be done by manually inspecting each row of the data. So we will have to check for data integrity. This particular data set deviates from what we would expect in three or more ways - find at least two of them. (Hint: Functions that may be useful here are summary(), str(), length(), levels(), nrow() and unique(); when using RStudio, you might also find the function View() helpful.)
Load 01_participant_info.rds using function readRDS(). It would be nice to have the data from the two dataframes combined. Find a way to do that, s.t. each row is one filled out question with the columns (in the given order): ID, user.name, Gender, qtype, question, response, Message
nchar() returns the number of characters in a string. Add a column message_length to the data frame that contains the length of Message in characters.
Use this new column and one additional column from the data frame to make a first plot that shows their relationship. Describe what the plot suggests to you and how confident you are in that suggestion holding up.
Save the data frame in the RDS format under name BDA+CM_HW1_YOURLASTNAME_Ex1_enquete.rds and send the result along with your other files in a compressed archive.

Exercise 2: plotting data from IMDB movie ratings

For this exercise, you will need the file movie_metadata.csv, which you can obtain here. The data should be self-explanatory enough that you can start playing around with it, but you can find more information about the data and things you could do with it on the creator’s blog post.

Load the data using readr::read_csv(). Our main variable of interest will be the movie ratings imdb_score. If your tibble containing the data has variable d, you can access this column by d$imbd_score. Explore the ratings by making three different plots:

Plot the distribution of movie ratings. In the same plot, overlay a normal distribution with the same mean and standard deviation as the movie ratings. Describe whether this plot tells you anything.
Make a plot of your choice that describes the movie ratings in terms of two categorical variables (i.e. variables with a fixed number of possible discrete values (levels) like genres or color). An obvious choice for this would be a facet plot, where one of the categorical variables determines the facets and the other is mapped to, say, color, but feel free to use any kind of plot that you like.
There are a number of numerical variables included in the data set as well (like the number of likes). Choose two and plot the relationship between each of them to the movie scores and answer the following questions:
- What do they tell you?
- Can you combine the information from both plots?
- What if you were interested in more than two numerical variables?

Exercise 3: MC simulation of the beta binomial distribution

We learned about the beta distribution and the binomial distribution in class. Here we are going to look at the beta binomial distribution which is a simple sequential combination of both. Although there is a precise mathematical characterization of the beta binomial distribution, and although there is an R package rmutil which implements (a version of) it, we will here explore it by Monte Carlo sampling in order to better understand it.

The beta binomial model is this. There is a latent bias parameter $\theta$ of a coin, but we do not know it. Our belief about possible values for $\theta$ is described by a beta distribution $\text{Beta}(\theta; \alpha, \beta)$ with shape parameters $\alpha$ and $\beta$. If $\theta$ has a fixed value, the probability of $k$ successful trials in $n$ flips is given by the binomial distribution $\text{B}(k ; n, \theta)$. The beta binomial model combines our uncertainty about $\theta$ with the binomial model. Think of it as first sampling a random $\theta$ from the distribution $\text{Beta}(\alpha, \beta)$ (we will write $\theta \sim \text{Beta}(\alpha, \beta)$), then throwing a coin with bias $\theta$ $n$ times and observing the number of successes; do this many times; the average frequency with which we observe any possible $k$ in this feed-forward sampling process will approximate the beta binomial distribution $\text{BetaBinom}(k ; \alpha, \beta, n)$.

Implement a Monte Carlo simulation to approximate $\text{BetaBinom}(k ; \alpha, \beta, n)$ with $\alpha = 8$, $\beta = 18$ and $n = 24$ in this way. To do so, sample many values of $\theta \sim \text{Beta}(\alpha, \beta)$, using the function rbeta() in R. Then sample the outcome of one or several coin tosses for this $\theta$, using function rbinom. Store the results and make a barplot of the proportion (not frequency counts!) with which each number occurred. To see whether you got it right and/or have enough samples, visually compare your simulation results to the analytic solution, which is:

\[ \text{BetaBinom}(k ; n, \alpha, \beta) = \binom{n}{k} \frac{\text{Beta}(k+\alpha, n-k+\beta)}{\text{Beta}(\alpha, \beta)}\] and looks like this for our concrete values:

alpha = 8
beta = 18
n = 24
tibble(k = 0:24,  # NB: `tibble` allows `k` to be used in next line; `data.frame` wouldn't
       probability = choose(n,k) * beta(k + alpha, n - k + beta) / beta(alpha, beta)) %>% 
  ggplot(aes(x = k, y = probability)) + geom_bar(stat = "identity")

Exercise 4: MC validation of one interpretation of confidence intervals

Use a Monte Carlo method to assess whether it is true that the confidence interval calculated by R’s function binom.test() contains the true value in 95% of the cases. To do this, repeat the following sampling process many times. First sample a random value for a coin bias $\theta \sim \text{Beta}(1,1)$ from a uniform distribution over the unit interval (realized here with a beta distribution with shape parameters both set to 1). Remember that you can get 1 sample from this distribution in R by writing:

theta = rbeta(1, shape1 = 1, shape2 = 1)

For each sampled value $\theta$, sample a value of $n \sim \text{Poisson}(\lambda = 10)$ from a Poisson distribution with mean and variance 10. In R, you get a sample from a Poisson distribution with parameter $\lambda = 10$ by writing:

n = rpois(1,10)

We use a Poisson distribution just for convenience, since it is a salient distribution to give us random integers from some suitable range, not because of any special theoretical reason.

Next, for each value pair of $\theta$ and $n$, draw a sample from a binomial distribution, i.e., sample a number of successes $k \sim \text{B}(n, \theta)$, using:

k = rbinom(1, n, theta)

Now calculate the confidence interval for your triple of $\theta$, $n$ and $k$, using:

binom.test(k,n,theta)$conf.int

You get the lower bound of this by binom.test(k,n,theta)$conf.int[1] and the upper bound by binom.test(k,n,theta)$conf.int[2] Check whether $\theta$ lies inside this confidence interval.

When you repeat all of this many times, record the number of times $\theta$ fell inside the confidence interval. If all is well, the proportion of cases where $\theta$ is inside the interval should converge to 95%. To check this, give a plot of the developement of the proprotion of cases where $\theta$ is inside the confidence interval over your iterations, like we did for the MC simulated $p$-value in class.

BDA+CM_2017: Homework 1

Exercise 1: tidying up the in-class enquete

Exercise 2: plotting data from IMDB movie ratings

Exercise 3: MC simulation of the beta binomial distribution

Exercise 4: MC validation of one interpretation of confidence intervals