First Homework: Data wrangling & summary statistics

Instructions

If you need help, take a look at the suggested readings in the lecture, make use of the cheat sheets and the help possibility in R
Create an Rmd-file with your group number (equivalent to StudIP group) in the ‘author’ heading and answer the following questions.
When all answers are ready, ‘Knit’ the document to produce a HTML file.
Create a ZIP archive called “IDA_HW1-Group-XYZ.zip” (where ‘XYZ’ is your group number) containing:
- an R Markdown file “IDA_HW1-Group-XYZ.Rmd”
- a knitted HTML document “IDA_HW1-Group-XYZ.html”
Upload the ZIP archive on Stud.IP in your group folder before the deadline. You may upload as many times as you like before the deadline, only your final submission will count.
Include an R code chunk in your Rmarkdown file (the preamble) in which you set the following global options for the document, and set the options for this code chunk to echo = F (so as not to have it show up in your output):

knitr::opts_chunk$set(
  warning = FALSE, # supress warnings per default 
  message = FALSE  # supress messages per default 
)

Then include a code chunk which loads all required packages (which is just tidyverse). Make sure that this code chunk, too, will not show in your output, using echo = F.
When chaining operations, please try to use the pipe %>% wherever reasonable. We will not indicate in a task explicitly that the pipe should be used, but we expect that you do it as a default of elegance.

Exercise 1: Fictitious data from a button-press reaction time experiment

Exercise 1.A: Tidy up the mess (20 points)

Here’s a messy data set from an experiment in which participants saw three critical conditions, and had to respond with pressing a button for either option A or option B. There were four participants in the experiment, identified anonymously in variable subject_id. The button press and associated reaction times of each of three trials are stored, respectively, in columns choices and reaction_times (in milliseconds) in a string which separates the data from different trials either with a comma (for choices) or a single white space (for reaction_times).

messy_data <- tribble(
  ~subject_id,  ~choices,  ~reaction_times,
  1,            "A,B,B",   "312 433 365",
  2,            "B,A,B",   "393 491 327",
  3,            "B,A,A",   "356 313 475",
  4,            "A,B,B",   "292 352 378"
)

Use tidyverse tools to tidy up this data set. Please make sure that your output looks exactly like this:

## # A tibble: 12 x 4
##    subject_id condition response    RT
##         <dbl> <chr>     <chr>    <int>
##  1          1 C_1       A          312
##  2          1 C_2       B          433
##  3          1 C_3       B          365
##  4          2 C_1       B          393
##  5          2 C_2       A          491
##  6          2 C_3       B          327
##  7          3 C_1       B          356
##  8          3 C_2       A          313
##  9          3 C_3       A          475
## 10          4 C_1       A          292
## 11          4 C_2       B          352
## 12          4 C_3       B          378

Hint: There are many ways to Rome. One way leading to the current Rome is to tidy up messy_data in two steps. Create a tidy data set for the choice data (using some combination of separate, a pivoting function and possibly select), and another one for the reaction time data (using basically the same chain of operations). You would then use a joining operation, e.g., full_join, possibly followed by massaging the output one more time with select. Careful: make sure that the column RT in the final output is of type numeric (integer or double does not matter).

Exercise 1.B: Summarize the reaction times (8 points)

Use the final tidy representation of the messy_data from the previous exercise, stored in a variable tidy_data. If you have not managed to produce this representation with tools from the tidyverse, you can write the desired tibble by hand (without loss of points for this exercise). Produce a summary table of mean reaction times per condition, using the tools from the tidyverse. Your output should look like this:

## # A tibble: 3 x 2
##   condition mean_RT
##   <chr>       <dbl>
## 1 C_1          338.
## 2 C_2          397.
## 3 C_3          386.

Now produce a table giving the mean reaction times for each participant. But make sure that, in this case, the mean reaction times are rounded to full integers. (Hint: you can use mutate in a final step or round inside of a call to summarise). The output should look like this:

## # A tibble: 4 x 2
##   subject_id mean_RT
##        <dbl>   <dbl>
## 1          1     370
## 2          2     404
## 3          3     381
## 4          4     341

Exercise 2: The King of France visits IDA

We will work with the King of France experiment, in particular with the data generated by participants of this course. For a detailed description of the theoretical background and the procedure look into the Appendix D.4 of your lecture script.

Here is a condensed description of the materials. The data set consists of five vignettes:

V1. The King of France is bald.
V2. The Emperor of Canada is fond of sushi.
V3. The Pope’s wife is a lawyer.
V4. The Belgian rainforest provides a habitat for many species.
V5. The volcanoes of Germany dominate the landscape.

Where each vignette consists of five critical conditions. The following five sentences are examples of the critical conditions for the first vignette.

C0. The king of France is bald.
C1. France has a king, and he is bald.
C6. The King of France isn’t bald.
C9. The King of France, he did not call Emmanuel Macron last night.
C10. Emmanuel Macron, he did not call the King of France last night.

Additionally, for each vignette there exists a background check. This sentence is intended to find out whether participants know whether the relevant presuppositions are true. The five background checks are:

BC1. France has a king.
BC2. The Pope is currently not married.
BC3. Canada is a democracy.
BC4. Belgium has rainforests.
BC5. Germany has volcanoes.

Finally, there are also 110 filler sentences, which do not have a presupposition, but also require common world knowledge for a correct answer. We will use the filler sentences also as controls, because there is a “correct” answer to each of these.

Exercise 2.A: Experimental design (10 points)

Look into the procedure described in the Appendix D.4 of your script and answer the following questions:

Is the “King of France” experiment an instance of a factorial design? If so, what is/are the factor(s), and what are the levels of each factor?
Is this experiment a within-subjects or a between-subjects design?
Give one advantage and one disadvantage for this design-type (within- vs between-subjects).
Is this experiment a repeated-measures design?
Indicate the dependent variable of the experiment (give the column name in the data representation) and the corresponding variable type.

Exercise 2.B: Exploring IDA’s King of France (14 points)

Load and inspect the data

Load the data from the in-class replication, using the following code:

data_KoF_raw_IDA <- 
  read_csv(url('https://raw.githubusercontent.com/michael-franke/intro-data-analysis/master/data_sets/king-of-france_data_raw_IDA.csv'))

At this moment you should familiarize yourself with the data, e.g., by using glimpse or View (the latter only works in RStudio). After getting familiar with the data, answer the following questions (using appropriate and concise R code, which you should also reproduce as part of your answers to be submitted):

How many rows does the data set in data_KoF_raw_IDA contain? (Hint: use the nrow function!)
How many participants took part in the study? (Hint: use a sequence of operations pull, unique and length.)
Print and inlcude in your HTML document a list of all comments given in the experiment, but printing each unique comment only once.
Print and inlcude in your HTML document a list of all answers given to the languages question, but printing each unique comment only once.
Calculate the grand average of the variable age, i.e., calculate the average age of every participant. (Hint: As soon as a vector contains missing data (an entry NA), it’s mean is NA as well. Try removing the missing values when calculating the mean, e.g., by checking the documentation of the function mean for anything helpful.)
Use the summary function to produce the five-number summary of the variable age. (NB: you do not need to remove NAs for this function.) The output of the summaryfunction shows a set of descriptive statistics that is often referred to as five-number summary (see for further explanation e.g. Five-number summary. It consists of the mean and the five most important sample percentiles: the sample minimum, the 0.25 quantile or first quartile, the 0.5 quantile or median, the 0.75 quantile or third quartile, the sample maximum.
Give the type of each of the following variables included in the data set (i.e., state whether it is ordinal, metric, etc.).

submission_id:
RT:
correct:
education:
item_version:
question:
response:
timeSpent:
trial_name:
trial_number:
trial_type:
vignette

Preprocessing the data

Selecting and creating the relevant columns (6 points)

Follow the preprocessing steps executed in the script up to (but not including) Section 4.5.3 (on cleaning). That is, copy-paste the code from the last code box in Section D.4.2 of the script to add the new column condition just like done in the script. Store the result in a variable called data_KoF_preprocessed_IDA and select only the columns submission_id, trial_number, condition, vignette, question, correct, response (in that order).

Your output should look like this:

data_KoF_preprocessed_IDA

## # A tibble: 2,040 x 7
##    submission_id trial_number condition  vignette question correct response
##            <dbl>        <dbl> <ord>      <chr>    <chr>    <lgl>   <lgl>   
##  1           277            1 filler     none     Big Ben~ FALSE   FALSE   
##  2           277            2 filler     none     The Gre~ TRUE    FALSE   
##  3           277            3 Condition~ 5        The vol~ FALSE   TRUE    
##  4           277            4 filler     none     The Uni~ TRUE    TRUE    
##  5           277            5 filler     none     Elvis P~ FALSE   FALSE   
##  6           277            6 filler     none     William~ FALSE   FALSE   
##  7           277            7 Condition~ 1        Emmanue~ FALSE   TRUE    
##  8           277            8 filler     none     There a~ TRUE    TRUE    
##  9           277            9 Condition~ 2        The Emp~ FALSE   TRUE    
## 10           277           10 filler     none     Monkeys~ TRUE    TRUE    
## # ... with 2,030 more rows

Tidy? (4 points)

Is this last data representation tidy? Why (not)?

Towards testing a hypothesis (20 points)

Section D.4.1.2 of the script lists a number of research questions that we could raise for this data set. Let’s focus on the second, reproduced here:

Is there a difference in (binary) truth-value judgements (aggregated over all vignettes) between C0 (with presupposition) and C1 (where the presupposition is part of the at-issue / asserted content)?

While we are still far from performing a statistical analysis, we do already have the tools to get at least an indicative pair of numbers that might help address this question, namely the proportion of “true”-judgements in condition C0 and those in C1. Compute these proportions by:

starting with the data stored in data_KoF_preprocessed_IDA
filtering out all rows other than data from critical conditions C0 and C1 (Hint: you might find the operator %in% very useful, which tests whether some element is included in a vector, e.g., as in the expression condition %in% c("Condition 0", "Condition 1"))
grouping by the variable condition
using summarise to obtain the proportion of true judgements (Hint: if x is a boolean vector, then mean(x) will cast x into an integer vector representing each entry of TRUE as 1 and each entry of FALSE as 0, so that the mean will be exactly the proportion of occurrences of TRUE in the vector x.)

Your final output should look like this:

## # A tibble: 2 x 2
##   condition   proportion_true
##   <ord>                 <dbl>
## 1 Condition 0          0.153 
## 2 Condition 1          0.0941

Notice that there is a perceptible difference in these numbers, but we will yet need to learn about ways of translating such numbers into mental currency, i.e., methods of translating such numbers (or the data that produced them) into statements of evidence (such as: “The data provides evidence that the conditions are different.”) or into decision criteria regarding whether to act as if we knew beyond doubt whether the proportions are equal or not. This is what we will learn in the remainder of this course.

Homework Sheet 2 – Data Wrangling

Due: Friday, November 22 by 11:59 CET