4.5 Case study: the King of France

badge-data-wrangling

Let’s go through one case study of data preprocessing and cleaning. We look at the example introduced and fully worked out in Appendix D.3. (Please read Section D.3.1 to find out more about where this data set is coming from.)

The raw data set is part of the aida package and can be loaded using:

data_KoF_raw <- aida::data_KoF_raw

We then take a glimpse at the data:

glimpse(data_KoF_raw )

## Rows: 2,813
## Columns: 16
## $ submission_id  <dbl> 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, …
## $ RT             <dbl> 8110, 35557, 3647, 16037, 11816, 6024, 4986, 13019, 538…
## $ age            <dbl> 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,…
## $ comments       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ item_version   <chr> "none", "none", "none", "none", "none", "none", "none",…
## $ correct_answer <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ education      <chr> "Graduated College", "Graduated College", "Graduated Co…
## $ gender         <chr> "female", "female", "female", "female", "female", "fema…
## $ languages      <chr> "English", "English", "English", "English", "English", …
## $ question       <chr> "World War II was a global war that lasted from 1914 to…
## $ response       <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ timeSpent      <dbl> 39.48995, 39.48995, 39.48995, 39.48995, 39.48995, 39.48…
## $ trial_name     <chr> "practice_trials", "practice_trials", "practice_trials"…
## $ trial_number   <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ trial_type     <chr> "practice", "practice", "practice", "practice", "practi…
## $ vignette       <chr> "undefined", "undefined", "undefined", "undefined", "un…

The variables in this data set are:

submission_id: unique identifier for each participant
RT: the reaction time for each decision
age: the (self-reported) age of the participant
comments: the (optional) comments each participant may have given
item_version: the condition which the test sentence belongs to (only given for trials of type main and special)
correct_answer: for trials of type filler and special what the true answer should have been
education: the (self-reported) education level with options Graduated College, Graduated High School, Higher Degree
gender: (self-reported) gender
languages: (self-reported) native languages
question: the sentence to be judged true or false
response: the answer (“TRUE” or “FALSE”) on each trial
trial_name: whether the trial is a main or practice trials (levels main_trials and practice_trials)
trial_number: consecutive numbering of each participant’s trial
trial_type: whether the trial was of the category filler, main, practice or special, where the latter encodes the “background checks”
vignette: the current item’s vignette number (applies only to trials of type main and special)

Let’s have a brief look at the comments (sometimes helpful, usually entertaining) and the self-reported native languages:

data_KoF_raw %>% pull(comments) %>% unique

##  [1] NA                                                                                                                                                         
##  [2] "I hope I was right most of the time!"                                                                                                                     
##  [3] "My level of education is Some Highschool, not finished. So I couldn't input what was correct, so I'm leaving a comment here."                             
##  [4] "It was interesting, and made re-read questions to make sure they weren't tricks. I hope I got them all correct."                                          
##  [5] "Worked well"                                                                                                                                              
##  [6] "A surprisingly tricky study! Thoroughly enjoyed completing it, despite several red herrings!!"                                                            
##  [7] "n/a"                                                                                                                                                      
##  [8] "Thank you for the opportunity."                                                                                                                           
##  [9] "this was challenging"                                                                                                                                     
## [10] "I'm not good at learning history so i might of made couple of mistakes. I hope I did well. :)"                                                            
## [11] "Interesting survey - thanks!"                                                                                                                             
## [12] "no"                                                                                                                                                       
## [13] "Regarding the practice question - I'm aware that Alexander Bell invented the telephone, but in reality, it was a collaborative effort by a team of people"
## [14] "Fun study!"                                                                                                                                               
## [15] "Fun stuff"

data_KoF_raw %>% pull(languages) %>% unique

##  [1] "English"             "english"             "English, Italian"   
##  [4] "English/ ASL"        "English and Polish"  "Chinese"            
##  [7] "English, Mandarin"   "Polish"              "Turkish"            
## [10] NA                    "English, Sarcasm"    "English, Portuguese"

We might wish to exclude people who do not include “English” as one of their native languages in some studies. Here, we do not since we also have strong, more specific filters on comprehension (see below). Since we are not going to use this information later on, we might as well discard it now:

data_KoF_raw <- data_KoF_raw %>% 
  select(-languages, -comments, -age, -RT, -education, -gender)

But even after pruning irrelevant columns, this data set is still not ideal. We need to preprocess it more thoroughly to make it more intuitively manageable. For example, the information in column trial_name does not give the trial’s name in an intuitive sense, but its type: whether it is a practice or a main trial. But this information, and more, is also represented in the column trial_type. The column item_version contains information about the experimental condition. To see this (mess), the code below prints the selected information from the main trials of only one participant in an order that makes it easier to see what is what.

data_KoF_raw %>% 
  # ignore practice trials for the moment
  # focus on one participant only
  filter(trial_type != "practice", submission_id == 192) %>% 
  select(trial_type, item_version, question) %>% 
  arrange(desc(trial_type), item_version) %>% 
  print(n = Inf)

## # A tibble: 24 × 3
##    trial_type item_version question                                             
##    <chr>      <chr>        <chr>                                                
##  1 special    none         The Pope is currently not married.                   
##  2 special    none         Germany has volcanoes.                               
##  3 special    none         France has a king.                                   
##  4 special    none         Canada is a democracy.                               
##  5 special    none         Belgium has rainforests.                             
##  6 main       0            The volcanoes of Germany dominate the landscape.     
##  7 main       1            Canada has an emperor, and he is fond of sushi.      
##  8 main       10           Donald Trump, his favorite nature spot is not the Be…
##  9 main       6            The King of France isn’t bald.                       
## 10 main       9            The Pope’s wife, she did not invite Angela Merkel fo…
## 11 filler     none         The Solar System includes the planet Earth.          
## 12 filler     none         Vatican City is the world's largest country by land …
## 13 filler     none         Big Ben is a very large building in the middle of Pa…
## 14 filler     none         Harry Potter is a series of fantasy novels written b…
## 15 filler     none         Taj Mahal is a mausoleum on the bank of the river in…
## 16 filler     none         James Bond is a spanish dancer from Madrid.          
## 17 filler     none         The Pacific Ocean is a large ocean between Japan and…
## 18 filler     none         Australia has a very large border with Brazil.       
## 19 filler     none         Steve Jobs was an American inventor and co-founder o…
## 20 filler     none         Planet Earth is part of the galaxy ‘Milky Way’.      
## 21 filler     none         Germany shares borders with France, Belgium and Denm…
## 22 filler     none         Antarctica is a continent covered almost completely …
## 23 filler     none         The Statue of Liberty is a colossal sculpture on Lib…
## 24 filler     none         English is the main language in Australia, Britain a…

We see that the information in item_version specifies the critical condition. To make this more intuitively manageable, we would like to have a column called condition and it should, ideally, also contain useful information for the cases where trial_type is not main or special. That is why we will therefore remove the column trial_name completely, and create an informative column condition in which we learn of every row whether it belongs to one of the five experimental conditions, and if not whether it is a filler or a “background check” (= special) trial.

data_KoF_processed <- data_KoF_raw %>% 
  # drop redundant information in column `trial_name`
  select(-trial_name) %>% 
  # discard practice trials
  filter(trial_type != "practice") %>% 
  mutate(
    # add a 'condition' variable
    condition = case_when(
      trial_type == "special" ~ "background check",
      trial_type == "main" ~ str_c("Condition ", item_version),
      TRUE ~ "filler"
    ) %>% 
      # make the new 'condition' variable a factor
      factor( 
        ordered = T,
        levels = c(
          str_c("Condition ", c(0, 1, 6, 9, 10)), 
          "background check", "filler"
        )
      )
  )

4.5.1 Cleaning the data

We clean the data in two consecutive steps:

Remove all data from any participant who got more than 50% of the answers to the filler material wrong.
Remove individual main trials if the corresponding “background check” question was answered wrongly.

4.5.1.1 Cleaning by-participant

# look at error rates for filler sentences by subject
# mark every subject as an outlier when they 
# have a proportion of correct responses of less than 0.5 
subject_error_rate <- data_KoF_processed %>% 
  filter(trial_type == "filler") %>% 
  group_by(submission_id) %>% 
  summarise(
    proportion_correct = mean(correct_answer == response),
    outlier_subject = proportion_correct < 0.5
  ) %>% 
  arrange(proportion_correct)

Apply the cleaning step:

# add info about error rates and exclude outlier subject(s)
d_cleaned <- 
  full_join(data_KoF_processed, subject_error_rate, by = "submission_id") %>% 
  filter(outlier_subject == FALSE)

4.5.1.2 Cleaning by-trial

# exclude every critical trial whose 'background' test question was answered wrongly
d_cleaned <- d_cleaned %>% 
  # select only the 'background question' trials
  filter(trial_type == "special") %>% 
  # is the background question answered correctly?
  mutate(
    background_correct = correct_answer == response
  ) %>%
  # select only the relevant columns
  select(submission_id, vignette, background_correct) %>%
  # right join lines to original data set 
  right_join(d_cleaned, by = c("submission_id", "vignette")) %>% 
  # remove all special trials, as well as main trials with incorrect background check
  filter(trial_type == "main" & background_correct == TRUE)

For later reuse, both the preprocessed and the cleaned data set are included in the aida package as well. They are loaded by calling aida::data_KoF_preprocessed and aida::data_KoF_cleaned, respectively.