echo = F
(so as not to have it show up in your output):knitr::opts_chunk$set(
warning = FALSE, # supress warnings per default
message = FALSE # supress messages per default
)
Then include a code chunk which loads all required packages (which is just tidyverse
). Make sure that this code chunk, too, will not show in your output, using echo = F
.
When chaining operations, please try to use the pipe %>%
wherever reasonable. We will not indicate in a task explicitly that the pipe should be used, but we expect that you do it as a default of elegance.
In this exercise we will be exploring data on views and likes/dislikes from YouTube users in the US and Germany. The data consists of three data sets which we will load, plug together and then explore.
Read the data into R from the following URLs. Store the data in variables YouTube_data_US
YouTube_data_DE
and YouTube_data_categories
. Careful: the data in the “categories” data set is stored with the delimiter ;
, not ,
despite the file ending “.csv”. You therefore need to use the function read_delim
and specify the correct delimiter.
(NB: There might well be warnings about parsing failures, but you do not need to worry about them.)
url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
url_us <- str_c(url_prefix, "YouTube-US.csv")
url_de <- str_c(url_prefix, "YouTube-DE.csv")
url_cat <- str_c(url_prefix, "YouTube-categories.csv")
Print a glimpse of all three data sets
Discard all columns except title
, channel_title
, category_id
, tags
, views
, likes
, dislikes
, comment_count
from YouTube_data_US
and YouTube_data_DE
.
country
(2 points)Add a new column that indicates the country to each of YouTube_data_US
and YouTube_data_DE
. Concretely, add a column country
with entry “US” to YouTube_data_US
and entry “GER” to YouTube_data_DE
.
(Hint: If you specify a vector of length 1 inside of mutate
it will expand this to a vector of the length of the data you are adding to.)
Create a new data set out of YouTube_data_US
and YouTube_data_DE
by combining them row-wise. In other words, glue both data sets vertically together and save the new combined data set as YouTube_data_combined
. Print a count of the number of rows in the new data set.
The YouTube_data_categories
data set has three columns whereby it shares one of these columns, namely category_id
, with the data set YouTube_data_combined
. The columns category_name
and category_description
of the data set YouTube_data_categories
might be helpful in the analysis later. Therefore, we want to join the information from both sources into a single data set.
Join the information of the data sets YouTube_data_combined
and YouTube_data_categories
and save the new data set as YouTube_data_full
. Take a glimpse at it.
(Hint: Make use of full_join
and use the appropriate column for the parameter by
of that function.)
The outcome of this preprocessing is also stored in a data set available online, with which the following exercise will continue.
Load the pre-processed YouTube data into variable YouTube_data_full
from the following URL:
url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
url_full <- str_c(url_prefix, "YouTube-full.csv")
Calculate the mean values of likes
(to be stored in column mean_likes
) and dislikes
(to be stored in column mean_dislikes
) for each combination of entries in columns category_name
and country
. Order the resulting tibble by mean_likes
in descending order. The output should look roughly as follows:
## # A tibble: 28 x 4
## # Groups: category_name [15]
## category_name country mean_likes mean_dislikes
## <chr> <chr> <dbl> <dbl>
## 1 Music GER 478364. 12604.
## 2 Music US 105296. 3099.
## 3 Comedy US 69876. 1813.
## 4 Science & Technology GER 43893 1231.
## 5 Comedy GER 24846. 1391.
## 6 Howto & Style US 22811. 1688.
## 7 Entertainment US 19300. 939.
## 8 Howto & Style GER 18911 1340.
## 9 Education GER 18574. 234
## 10 Entertainment GER 16541. 677.
## # ... with 18 more rows
Find the title of the video with the most views in the category “Music” in Germany and the number of views and likes it has.
How many instances are there for each category (column category_name
) in the data set? - Sort the list of counts in ascending order.
Now find the category whose number of occurrences is the median of all counts of category occurences.
Select the columns country
, likes
, dislikes
and category_name
, then group the data set by country
and category_name
. Filter only the categories “Music” and “Science & Technology” and summarize the data set by calculating the mean and median for “likes” (name the summary-columns in a reasonable manner). The output should look (roughly) like this:
## # A tibble: 4 x 4
## # Groups: country [2]
## country category_name likes_mean likes_median
## <chr> <chr> <dbl> <dbl>
## 1 GER Music 478364. 17124
## 2 GER Science & Technology 43893 34929
## 3 US Music 105296. 14902
## 4 US Science & Technology 10736. 2826.
What could be a reasonable explanation for the difference between the values for median and mean for the category “Music” in the German data?
In this exercise you will write a function that recovers the mode of a categorical variable, which is supplied either as a character vector or a factor. There are many ways to do this, but for this execise we will use the tools of the tidyverse, in particular counting and filtering for maximal counts.
Concretely, write a function mode_of_factor
which takes as input a single vector (character or factor) and returns the elements that occur most frequently in this vector. If there are more than one element that have the highest number of occurrences, the function returns all of these values. To achieve this, you could do the following:
count
(or similar) to count the number of occurrences for all elements in the vectorGive an example of a metric vector and a single number such that adding the number to the vector does not change the median at all, but does change the mean drammatically. (Use R for the calculation of mean and median, so that there are no lingering doubts about how exactly to compute the median in case of ties etc.)
Produce the formula in the definition of the mean and variance of vector \(\vec{x}\) as they appear in the course script inside of Rmarkdown.