echo = F
(so as not to have it show up in your output):knitr::opts_chunk$set(
warning = FALSE, # supress warnings per default
message = FALSE # supress messages per default
)
Then include a code chunk which loads all required packages (which are just tidyverse
AND cowplot
). Make sure that this code chunk, too, will not show in your output, using echo = F
.
When chaining operations, please try to use the pipe %>%
wherever reasonable. We will not indicate in a task explicitly that the pipe should be used, but we expect that you do it as a default of elegance.
For present purposes, let’s define a 2.5% quantile of a vector \(\vec{x}\) as the smallest number \(l\) (which occurs in \(\vec{x}\)) such that the percentage of numbers in \(\vec{x}\) which are no bigger than \(l\) is bigger than 2.5%. Using this definition compute the lower bound of the 95% bootstrapped confidence interval of the mean for the vector \(\vec{d} = \langle 1, 2, 3, \rangle\). (It is not enough to just name the result. Please spell out the computation steps that you took to get to that result.)
Use the function bootstrapped_CI
defined in the course notes to compute the 95% bootstrapped CI for the following vectors (please use as parameter n_resamples = 1e5
for more precision in your estimates than the default number would give you):
d1 = c(1,2,3)
d2 = rep(d1,2)
d3 = rep(d1,10)
Describe in short, precise but intuitive terms why the results in the calculations for d1
, d2
and d3
differ in the way they do.
Give a mathematical proof of the claim that Pearson’s product-moment correlation is invariant under positive linear transformation. I.e., show that if \(x\) and \(y\) are one-dimensional vectors (of the same length \(n\)), and if \(x' = ax + b\) with \(a,b \in \mathbb{R}, a > 0\), then \(r_{xy} = r_{x'y}\).
To show this, follow the following steps (4 points each):
Read the WHO data set into R from the following URL
url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
WHO_data_url <- str_c(url_prefix, "WHO.csv")
Read the data WHO.csv and save the data set as d
. Take a glimpse at it.
geom_bar
(4 points)Make a bar plot using geom_bar
to answer the question of how many countries per region were investigated in this data set. Your barplot should look roughly as the following output. Make sure to use labs
to set the axis labels like in the example presented here.
geom_col
(4 points)Produce the same plot as before with two changes: first, use geom_col
(which entails that you have to do an additional step of data wrangling yourself); second, order the factor region
by the inverse of the calculated counts, using the function fct_reorder
. Make sure to also include the axis labels as before. Also include the plot title “Countries per region”. The output should look roughly lik the plot below:
Now, we want to visualize the population size per region, therefore we use geom_col
with Region
on the \(x\) axis and Population
on the \(y\) axis. Add the title “Population per region”.
Use the plot_grid
function from the cowplot
package to place the two previous plots into a single plot on top of each other. The output should look a bit like the following. (If you did not manage to produce one or both of the previous plots, it is fine if you just take any other plots to combine.)
Let us have now some closer look at the variable ChildMortality
in the WHO data set.
Create a tibble grouped by Region
with the following statistics for variable ChildMortality
:
If you are ambitious and want to try, you could do this using the summary
function that returns all of these values, and then use nested tibbles to produce the result. (This is perhaps a bit more complex than necessary, but might be fun for some of you to try.)
## # A tibble: 6 x 7
## Region Min `0.25_quant` `0.5_quant` mean `0.75_quant` Max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 13.1 58.6 81.8 84.0 102. 182.
## 2 Americas 5.3 13.0 17.5 19.3 22.4 75.6
## 3 Eastern Mediterr~ 7.4 11.2 18.4 40.2 69.8 147.
## 4 Europe 2.2 3.8 4.8 10.1 10.7 58.3
## 5 South-East Asia 9.6 21 40.9 35.0 48.4 56.7
## 6 Western Pacific 2.9 9.55 22.4 24.7 34.1 71.8
Our (sad) research question, to be addressed with plotting, is whether there are differences in the mean child mortality rate between different world regions. We would therefore like to plot the distributions of this (metric) variable side-by-side, using violin plots.
Calculate the means of the variable ChildMortality
for each region (using group_by
and mutate
) in a new column mean_cm
of WHO_data
. (Hint: it may be good to ungroup
afterwards.)
Plot the variable ChildMortality
on the \(y\)-axis, the variable Region
on the \(x\)-axis using geom_violin
, after reordering the variable Region
in ascending order based on the calculated means (using function fct_reorder
).
Calculate the means and 95% bootstrapped confidence intervals for the mean for the ChildMortality
for each region. Store the result in a variable ci_means_cm
. You can use the nested tibbles approach from the lecture, or you can glue each result together by hand.
Redo the previous plot and add another layer using the function geom_pointrange
to draw the 95% CIs for each region inside of the violin plots. (Hint: You would supply ci_means_cm
as the data for the layer defined by geom_pointrange
. You would pass the mapping aes(x = Region, y = mean, ymin = lower, ymax = upper)
to the function geom_pointrange
.) If you are extra ambitious, try to fill the violins with a different color for each region, hide any legend that might pop up and draw the point ranges in a color that is visible on top of the colors of the violin plots. Roughly like in the plot below.
–>