General instructions

knitr::opts_chunk$set(
  warning = FALSE, # supress warnings per default 
  message = FALSE  # supress messages per default 
)

Exercise 1: Bootstrapped confidence interval

Small vector (10 points)

For present purposes, let’s define a 2.5% quantile of a vector \(\vec{x}\) as the smallest number \(l\) (which occurs in \(\vec{x}\)) such that the percentage of numbers in \(\vec{x}\) which are no bigger than \(l\) is bigger than 2.5%. Using this definition compute the lower bound of the 95% bootstrapped confidence interval of the mean for the vector \(\vec{d} = \langle 1, 2, 3, \rangle\). (It is not enough to just name the result. Please spell out the computation steps that you took to get to that result.)

Larger vector (4 points)

Use the function bootstrapped_CI defined in the course notes to compute the 95% bootstrapped CI for the following vectors (please use as parameter n_resamples = 1e5 for more precision in your estimates than the default number would give you):

d1 = c(1,2,3)
d2 = rep(d1,2)
d3 = rep(d1,10)

Effect of vector size (6 points)

Describe in short, precise but intuitive terms why the results in the calculations for d1, d2 and d3 differ in the way they do.

Exercise 2: Correlation is invariant under positive linear transformation (16 points)

Give a mathematical proof of the claim that Pearson’s product-moment correlation is invariant under positive linear transformation. I.e., show that if \(x\) and \(y\) are one-dimensional vectors (of the same length \(n\)), and if \(x' = ax + b\) with \(a,b \in \mathbb{R}, a > 0\), then \(r_{xy} = r_{x'y}\).

To show this, follow the following steps (4 points each):

  1. Show that Pearson’s correlation can be written in terms of standardized vectors \(z_{x}\) and \(z_{y}\), where \(z_{x_i} = \frac{x_i - \mu_{x}}{s_x}\) and similarly for \(z_y\). In particular, show that \(r_{xy} = \frac{1}{n} \sum_{i = 1}^n z_{x_i} \ z_{y_i}\).
  2. Show that \(\mu_{x'} = a\ \mu_{x} + b\).
  3. Show that \(sd_{x'} = a\ sd_x\).
  4. Using the facts in a.-c., proof the invariance claim.

Exercise 3: Plotting bars for the WHO data

Read the data into R (2 points)

Read the WHO data set into R from the following URL

url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
WHO_data_url  <- str_c(url_prefix, "WHO.csv")

Read the data WHO.csv and save the data set as d. Take a glimpse at it.

Make a bar plot with geom_bar (4 points)

Make a bar plot using geom_bar to answer the question of how many countries per region were investigated in this data set. Your barplot should look roughly as the following output. Make sure to use labs to set the axis labels like in the example presented here.

Make a bar plot with geom_col (4 points)

Produce the same plot as before with two changes: first, use geom_col (which entails that you have to do an additional step of data wrangling yourself); second, order the factor region by the inverse of the calculated counts, using the function fct_reorder. Make sure to also include the axis labels as before. Also include the plot title “Countries per region”. The output should look roughly lik the plot below:

Plotting population per region (4 points)

Now, we want to visualize the population size per region, therefore we use geom_col with Region on the \(x\) axis and Population on the \(y\) axis. Add the title “Population per region”.

Combining plots (4 points)

Use the plot_grid function from the cowplot package to place the two previous plots into a single plot on top of each other. The output should look a bit like the following. (If you did not manage to produce one or both of the previous plots, it is fine if you just take any other plots to combine.)

Exercise 4: Violin plots for the WHO data

Let us have now some closer look at the variable ChildMortality in the WHO data set.

Create summary statistics (4 points)

Create a tibble grouped by Region with the following statistics for variable ChildMortality:

  • minimum (Min),
  • 0.25 quantile (“0.25_quant”),
  • 0.5 quantile (“0.5_quant”),
  • mean (mean)
  • 0.75 quantile (“0.75_quant”),
  • maximum (Max).

If you are ambitious and want to try, you could do this using the summary function that returns all of these values, and then use nested tibbles to produce the result. (This is perhaps a bit more complex than necessary, but might be fun for some of you to try.)

## # A tibble: 6 x 7
##   Region              Min `0.25_quant` `0.5_quant`  mean `0.75_quant`   Max
##   <chr>             <dbl>        <dbl>       <dbl> <dbl>        <dbl> <dbl>
## 1 Africa             13.1        58.6         81.8  84.0        102.  182. 
## 2 Americas            5.3        13.0         17.5  19.3         22.4  75.6
## 3 Eastern Mediterr~   7.4        11.2         18.4  40.2         69.8 147. 
## 4 Europe              2.2         3.8          4.8  10.1         10.7  58.3
## 5 South-East Asia     9.6        21           40.9  35.0         48.4  56.7
## 6 Western Pacific     2.9         9.55        22.4  24.7         34.1  71.8

Violin plots for group comparisons (of means) (6 points)

Our (sad) research question, to be addressed with plotting, is whether there are differences in the mean child mortality rate between different world regions. We would therefore like to plot the distributions of this (metric) variable side-by-side, using violin plots.

Calculate the means of the variable ChildMortality for each region (using group_by and mutate) in a new column mean_cm of WHO_data. (Hint: it may be good to ungroup afterwards.)

Plot the variable ChildMortality on the \(y\)-axis, the variable Region on the \(x\)-axis using geom_violin, after reordering the variable Region in ascending order based on the calculated means (using function fct_reorder).

Adding means and confidence intervals to the violin plot (6 points)

Calculate the means and 95% bootstrapped confidence intervals for the mean for the ChildMortality for each region. Store the result in a variable ci_means_cm. You can use the nested tibbles approach from the lecture, or you can glue each result together by hand.

Redo the previous plot and add another layer using the function geom_pointrange to draw the 95% CIs for each region inside of the violin plots. (Hint: You would supply ci_means_cm as the data for the layer defined by geom_pointrange. You would pass the mapping aes(x = Region, y = mean, ymin = lower, ymax = upper) to the function geom_pointrange.) If you are extra ambitious, try to fill the violins with a different color for each region, hide any legend that might pop up and draw the point ranges in a color that is visible on top of the colors of the violin plots. Roughly like in the plot below.

–>