General instructions

If you need help, take a look at the suggested readings in the lecture, make use of the cheat sheets and the help possibility in R
Create an Rmd-file with your group number (equivalent to StudIP group) in the ‘author’ heading and answer the following questions.
When all answers are ready, ‘Knit’ the document to produce a HTML file.
Create a ZIP archive called “IDA_HW4-Group-XYZ.zip” (where ‘XYZ’ is your group number) containing:
- an R Markdown file “IDA_HW4-Group-XYZ.Rmd”
- a knitted HTML document “IDA_HW4-Group-XYZ.html”
Upload the ZIP archive on Stud.IP in your group folder before the deadline. You may upload as many times as you like before the deadline, only your final submission will count.
Include an R code chunk in your Rmarkdown file (the preamble) in which you set the following global options for the document, and set the options for this code chunk to echo = F (so as not to have it show up in your output):

knitr::opts_chunk$set(
  warning = FALSE, # supress warnings per default 
  message = FALSE  # supress messages per default 
)

Then include a code chunk which loads all required packages (which are just tidyverse AND cowplot). Make sure that this code chunk, too, will not show in your output, using echo = F.
When chaining operations, please try to use the pipe %>% wherever reasonable. We will not indicate in a task explicitly that the pipe should be used, but we expect that you do it as a default of elegance.

Exercise 1: Bootstrapped confidence interval

Small vector (10 points)

For present purposes, let’s define a 2.5% quantile of a vector \(\vec{x}\) as the smallest number \(l\) (which occurs in \(\vec{x}\)) such that the percentage of numbers in \(\vec{x}\) which are no bigger than \(l\) is bigger than 2.5%. Using this definition compute the lower bound of the 95% bootstrapped confidence interval of the mean for the vector \(\vec{d} = \langle 1, 2, 3, \rangle\). (It is not enough to just name the result. Please spell out the computation steps that you took to get to that result.)

Larger vector (4 points)

Use the function bootstrapped_CI defined in the course notes to compute the 95% bootstrapped CI for the following vectors (please use as parameter n_resamples = 1e5 for more precision in your estimates than the default number would give you):

d1 = c(1,2,3)
d2 = rep(d1,2)
d3 = rep(d1,10)

Effect of vector size (6 points)

Describe in short, precise but intuitive terms why the results in the calculations for d1, d2 and d3 differ in the way they do.

Exercise 2: Correlation is invariant under positive linear transformation (16 points)

Give a mathematical proof of the claim that Pearson’s product-moment correlation is invariant under positive linear transformation. I.e., show that if \(x\) and \(y\) are one-dimensional vectors (of the same length \(n\)), and if \(x' = ax + b\) with \(a,b \in \mathbb{R}, a > 0\), then \(r_{xy} = r_{x'y}\).

To show this, follow the following steps (4 points each):

Show that Pearson’s correlation can be written in terms of standardized vectors \(z_{x}\) and \(z_{y}\), where \(z_{x_i} = \frac{x_i - \mu_{x}}{s_x}\) and similarly for \(z_y\). In particular, show that \(r_{xy} = \frac{1}{n} \sum_{i = 1}^n z_{x_i} \ z_{y_i}\).
Show that \(\mu_{x'} = a\ \mu_{x} + b\).
Show that \(sd_{x'} = a\ sd_x\).
Using the facts in a.-c., proof the invariance claim.

Exercise 3: Plotting bars for the WHO data

Read the data into R (2 points)

Read the WHO data set into R from the following URL

url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
WHO_data_url  <- str_c(url_prefix, "WHO.csv")

Read the data WHO.csv and save the data set as d. Take a glimpse at it.

Make a bar plot with `geom_bar` (4 points)

Make a bar plot using geom_bar to answer the question of how many countries per region were investigated in this data set. Your barplot should look roughly as the following output. Make sure to use labs to set the axis labels like in the example presented here.

Make a bar plot with `geom_col` (4 points)

Produce the same plot as before with two changes: first, use geom_col (which entails that you have to do an additional step of data wrangling yourself); second, order the factor region by the inverse of the calculated counts, using the function fct_reorder. Make sure to also include the axis labels as before. Also include the plot title “Countries per region”. The output should look roughly lik the plot below:

Plotting population per region (4 points)

Now, we want to visualize the population size per region, therefore we use geom_col with Region on the \(x\) axis and Population on the \(y\) axis. Add the title “Population per region”.

Combining plots (4 points)

Use the plot_grid function from the cowplot package to place the two previous plots into a single plot on top of each other. The output should look a bit like the following. (If you did not manage to produce one or both of the previous plots, it is fine if you just take any other plots to combine.)

Exercise 4: Violin plots for the WHO data

Let us have now some closer look at the variable ChildMortality in the WHO data set.

Create summary statistics (4 points)

Create a tibble grouped by Region with the following statistics for variable ChildMortality:

minimum (Min),
0.25 quantile (“0.25_quant”),
0.5 quantile (“0.5_quant”),
mean (mean)
0.75 quantile (“0.75_quant”),
maximum (Max).

If you are ambitious and want to try, you could do this using the summary function that returns all of these values, and then use nested tibbles to produce the result. (This is perhaps a bit more complex than necessary, but might be fun for some of you to try.)

## # A tibble: 6 x 7
##   Region              Min `0.25_quant` `0.5_quant`  mean `0.75_quant`   Max
##   <chr>             <dbl>        <dbl>       <dbl> <dbl>        <dbl> <dbl>
## 1 Africa             13.1        58.6         81.8  84.0        102.  182. 
## 2 Americas            5.3        13.0         17.5  19.3         22.4  75.6
## 3 Eastern Mediterr~   7.4        11.2         18.4  40.2         69.8 147. 
## 4 Europe              2.2         3.8          4.8  10.1         10.7  58.3
## 5 South-East Asia     9.6        21           40.9  35.0         48.4  56.7
## 6 Western Pacific     2.9         9.55        22.4  24.7         34.1  71.8

Violin plots for group comparisons (of means) (6 points)

Our (sad) research question, to be addressed with plotting, is whether there are differences in the mean child mortality rate between different world regions. We would therefore like to plot the distributions of this (metric) variable side-by-side, using violin plots.

Calculate the means of the variable ChildMortality for each region (using group_by and mutate) in a new column mean_cm of WHO_data. (Hint: it may be good to ungroup afterwards.)

Plot the variable ChildMortality on the \(y\)-axis, the variable Region on the \(x\)-axis using geom_violin, after reordering the variable Region in ascending order based on the calculated means (using function fct_reorder).

Adding means and confidence intervals to the violin plot (6 points)

Calculate the means and 95% bootstrapped confidence intervals for the mean for the ChildMortality for each region. Store the result in a variable ci_means_cm. You can use the nested tibbles approach from the lecture, or you can glue each result together by hand.

Redo the previous plot and add another layer using the function geom_pointrange to draw the 95% CIs for each region inside of the violin plots. (Hint: You would supply ci_means_cm as the data for the layer defined by geom_pointrange. You would pass the mapping aes(x = Region, y = mean, ymin = lower, ymax = upper) to the function geom_pointrange.) If you are extra ambitious, try to fill the violins with a different color for each region, hide any legend that might pop up and draw the point ranges in a color that is visible on top of the colors of the violin plots. Roughly like in the plot below.

–>

Homework Sheet 4 – Plotting

Due: Friday, Dezember 06 by 11:59 CET

General instructions

Exercise 1: Bootstrapped confidence interval

Small vector (10 points)

Larger vector (4 points)

Effect of vector size (6 points)

Exercise 2: Correlation is invariant under positive linear transformation (16 points)

Exercise 3: Plotting bars for the WHO data

Read the data into R (2 points)

Make a bar plot with `geom_bar` (4 points)

Make a bar plot with `geom_col` (4 points)

Plotting population per region (4 points)

Combining plots (4 points)

Exercise 4: Violin plots for the WHO data

Create summary statistics (4 points)

Violin plots for group comparisons (of means) (6 points)

Adding means and confidence intervals to the violin plot (6 points)

Homework Sheet 4 – Plotting

Due: Friday, Dezember 06 by 11:59 CET

General instructions

Exercise 1: Bootstrapped confidence interval

Small vector (10 points)

Larger vector (4 points)

Effect of vector size (6 points)

Exercise 2: Correlation is invariant under positive linear transformation (16 points)

Exercise 3: Plotting bars for the WHO data

Read the data into R (2 points)

Make a bar plot with geom_bar (4 points)

Make a bar plot with geom_col (4 points)

Plotting population per region (4 points)

Combining plots (4 points)

Exercise 4: Violin plots for the WHO data

Create summary statistics (4 points)

Violin plots for group comparisons (of means) (6 points)

Adding means and confidence intervals to the violin plot (6 points)

Make a bar plot with `geom_bar` (4 points)

Make a bar plot with `geom_col` (4 points)