General Instructions

If you need help, take a look at the suggested readings in the lecture, make use of the cheat sheets and the help possibility in R
Create an Rmd-file with your group number (equivalent to StudIP group) in the ‘author’ heading and answer the following questions.
When all answers are ready, ‘Knit’ the document to produce a HTML file.
Create a ZIP archive called “IDA_HW3-Group-XYZ.zip” (where ‘XYZ’ is your group number) containing:
- an R Markdown file “IDA_HW3-Group-XYZ.Rmd”
- a knitted HTML document “IDA_HW3-Group-XYZ.html”
Upload the ZIP archive on Stud.IP in your group folder before the deadline. You may upload as many times as you like before the deadline, only your final submission will count.
Include an R code chunk in your Rmarkdown file (the preamble) in which you set the following global options for the document, and set the options for this code chunk to echo = F (so as not to have it show up in your output):

knitr::opts_chunk$set(
  warning = FALSE, # supress warnings per default 
  message = FALSE  # supress messages per default 
)

Then include a code chunk which loads all required packages (which is just tidyverse). Make sure that this code chunk, too, will not show in your output, using echo = F.
When chaining operations, please try to use the pipe %>% wherever reasonable. We will not indicate in a task explicitly that the pipe should be used, but we expect that you do it as a default of elegance.

Exercise 1: Preparing the YouTube data

In this exercise we will be exploring data on views and likes/dislikes from YouTube users in the US and Germany. The data consists of three data sets which we will load, plug together and then explore.

Reading & inspecting the data (4 Points)

Read the data into R from the following URLs. Store the data in variables YouTube_data_US YouTube_data_DE and YouTube_data_categories. Careful: the data in the “categories” data set is stored with the delimiter ;, not , despite the file ending “.csv”. You therefore need to use the function read_delim and specify the correct delimiter.

(NB: There might well be warnings about parsing failures, but you do not need to worry about them.)

url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
url_us  <- str_c(url_prefix, "YouTube-US.csv")
url_de  <- str_c(url_prefix, "YouTube-DE.csv")
url_cat <- str_c(url_prefix, "YouTube-categories.csv")

Print a glimpse of all three data sets

Pruning the data (2 points)

Discard all columns except title, channel_title, category_id, tags, views, likes, dislikes, comment_count from YouTube_data_US and YouTube_data_DE.

Adding a column `country` (2 points)

Add a new column that indicates the country to each of YouTube_data_US and YouTube_data_DE. Concretely, add a column country with entry “US” to YouTube_data_US and entry “GER” to YouTube_data_DE.

(Hint: If you specify a vector of length 1 inside of mutate it will expand this to a vector of the length of the data you are adding to.)

Binding data sets (2 points)

Create a new data set out of YouTube_data_USand YouTube_data_DE by combining them row-wise. In other words, glue both data sets vertically together and save the new combined data set as YouTube_data_combined. Print a count of the number of rows in the new data set.

Joining data sets (2 points)

The YouTube_data_categories data set has three columns whereby it shares one of these columns, namely category_id, with the data set YouTube_data_combined. The columns category_name and category_description of the data set YouTube_data_categories might be helpful in the analysis later. Therefore, we want to join the information from both sources into a single data set.

Join the information of the data sets YouTube_data_combined and YouTube_data_categories and save the new data set as YouTube_data_full. Take a glimpse at it.

(Hint: Make use of full_join and use the appropriate column for the parameter by of that function.)

The outcome of this preprocessing is also stored in a data set available online, with which the following exercise will continue.

Exercise 2: Exploring the YouTube data

Load the pre-processed YouTube data (2 points)

Load the pre-processed YouTube data into variable YouTube_data_full from the following URL:

url_prefix <- "https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/"
url_full  <- str_c(url_prefix, "YouTube-full.csv")

Sorting by mean likes (4 pages)

Calculate the mean values of likes (to be stored in column mean_likes) and dislikes (to be stored in column mean_dislikes) for each combination of entries in columns category_name and country. Order the resulting tibble by mean_likes in descending order. The output should look roughly as follows:

## # A tibble: 28 x 4
## # Groups:   category_name [15]
##    category_name        country mean_likes mean_dislikes
##    <chr>                <chr>        <dbl>         <dbl>
##  1 Music                GER        478364.        12604.
##  2 Music                US         105296.         3099.
##  3 Comedy               US          69876.         1813.
##  4 Science & Technology GER         43893          1231.
##  5 Comedy               GER         24846.         1391.
##  6 Howto & Style        US          22811.         1688.
##  7 Entertainment        US          19300.          939.
##  8 Howto & Style        GER         18911          1340.
##  9 Education            GER         18574.          234 
## 10 Entertainment        GER         16541.          677.
## # ... with 18 more rows

Most viewed music video in Germany (4 points)

Find the title of the video with the most views in the category “Music” in Germany and the number of views and likes it has.

Counts of categories (6 points)

How many instances are there for each category (column category_name) in the data set? - Sort the list of counts in ascending order.

Now find the category whose number of occurrences is the median of all counts of category occurences.

Compare means and median (6 points)

Select the columns country, likes, dislikes and category_name, then group the data set by countryand category_name. Filter only the categories “Music” and “Science & Technology” and summarize the data set by calculating the mean and median for “likes” (name the summary-columns in a reasonable manner). The output should look (roughly) like this:

## # A tibble: 4 x 4
## # Groups:   country [2]
##   country category_name        likes_mean likes_median
##   <chr>   <chr>                     <dbl>        <dbl>
## 1 GER     Music                   478364.       17124 
## 2 GER     Science & Technology     43893        34929 
## 3 US      Music                   105296.       14902 
## 4 US      Science & Technology     10736.        2826.

What could be a reasonable explanation for the difference between the values for median and mean for the category “Music” in the German data?

Exercise 3: write a function that recovers the mode (12 points)

In this exercise you will write a function that recovers the mode of a categorical variable, which is supplied either as a character vector or a factor. There are many ways to do this, but for this execise we will use the tools of the tidyverse, in particular counting and filtering for maximal counts.

Concretely, write a function mode_of_factor which takes as input a single vector (character or factor) and returns the elements that occur most frequently in this vector. If there are more than one element that have the highest number of occurrences, the function returns all of these values. To achieve this, you could do the following:

add the input vector as a column to a tibble
use count (or similar) to count the number of occurrences for all elements in the vector
filter out the rows whose count is equal to the maximal count in the column with all counts
return the names of the elements left after filtering, ideally as a character vector

Exercise 4: Toying with mean and median (6 points)

Give an example of a metric vector and a single number such that adding the number to the vector does not change the median at all, but does change the mean drammatically. (Use R for the calculation of mean and median, so that there are no lingering doubts about how exactly to compute the median in case of ties etc.)

Exercise 5: LaTeX in Rmarkdown (6 points)

Produce the formula in the definition of the mean and variance of vector \(\vec{x}\) as they appear in the course script inside of Rmarkdown.

Homework Sheet 3 – Summary Statistics

Due: Friday, November 29 by 11:59 CET

General Instructions

Exercise 1: Preparing the YouTube data

Reading & inspecting the data (4 Points)

Pruning the data (2 points)

Adding a column `country` (2 points)

Binding data sets (2 points)

Joining data sets (2 points)

Exercise 2: Exploring the YouTube data

Load the pre-processed YouTube data (2 points)

Sorting by mean likes (4 pages)

Most viewed music video in Germany (4 points)

Counts of categories (6 points)

Compare means and median (6 points)

Exercise 3: write a function that recovers the mode (12 points)

Exercise 4: Toying with mean and median (6 points)

Exercise 5: LaTeX in Rmarkdown (6 points)

Homework Sheet 3 – Summary Statistics

Due: Friday, November 29 by 11:59 CET

General Instructions

Exercise 1: Preparing the YouTube data

Reading & inspecting the data (4 Points)

Pruning the data (2 points)

Adding a column country (2 points)

Binding data sets (2 points)

Joining data sets (2 points)

Exercise 2: Exploring the YouTube data

Load the pre-processed YouTube data (2 points)

Sorting by mean likes (4 pages)

Most viewed music video in Germany (4 points)

Counts of categories (6 points)

Compare means and median (6 points)

Exercise 3: write a function that recovers the mode (12 points)

Exercise 4: Toying with mean and median (6 points)

Exercise 5: LaTeX in Rmarkdown (6 points)

Adding a column `country` (2 points)