4.5 Case study: the King of France

Let’s go through one case study of data preprocessing and cleaning. We look at the example introduced and fully worked out in Appendix D.3. (Please read Section D.3.1 to find out more about where this data set is coming from.)
The raw data set is part of the aida package and can be loaded using:
data_KoF_raw <- aida::data_KoF_rawWe then take a glimpse at the data:
glimpse(data_KoF_raw )## Rows: 2,813
## Columns: 16
## $ submission_id <dbl> 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, …
## $ RT <dbl> 8110, 35557, 3647, 16037, 11816, 6024, 4986, 13019, 538…
## $ age <dbl> 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,…
## $ comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ item_version <chr> "none", "none", "none", "none", "none", "none", "none",…
## $ correct_answer <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ education <chr> "Graduated College", "Graduated College", "Graduated Co…
## $ gender <chr> "female", "female", "female", "female", "female", "fema…
## $ languages <chr> "English", "English", "English", "English", "English", …
## $ question <chr> "World War II was a global war that lasted from 1914 to…
## $ response <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ timeSpent <dbl> 39.48995, 39.48995, 39.48995, 39.48995, 39.48995, 39.48…
## $ trial_name <chr> "practice_trials", "practice_trials", "practice_trials"…
## $ trial_number <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ trial_type <chr> "practice", "practice", "practice", "practice", "practi…
## $ vignette <chr> "undefined", "undefined", "undefined", "undefined", "un…
The variables in this data set are:
submission_id: unique identifier for each participantRT: the reaction time for each decisionage: the (self-reported) age of the participantcomments: the (optional) comments each participant may have givenitem_version: the condition which the test sentence belongs to (only given for trials of typemainandspecial)correct_answer: for trials of typefillerandspecialwhat the true answer should have beeneducation: the (self-reported) education level with optionsGraduated College,Graduated High School,Higher Degreegender: (self-reported) genderlanguages: (self-reported) native languagesquestion: the sentence to be judged true or falseresponse: the answer (“TRUE” or “FALSE”) on each trialtrial_name: whether the trial is a main or practice trials (levelsmain_trialsandpractice_trials)trial_number: consecutive numbering of each participant’s trialtrial_type: whether the trial was of the categoryfiller,main,practiceorspecial, where the latter encodes the “background checks”vignette: the current item’s vignette number (applies only to trials of typemainandspecial)
Let’s have a brief look at the comments (sometimes helpful, usually entertaining) and the self-reported native languages:
data_KoF_raw %>% pull(comments) %>% unique## [1] NA
## [2] "I hope I was right most of the time!"
## [3] "My level of education is Some Highschool, not finished. So I couldn't input what was correct, so I'm leaving a comment here."
## [4] "It was interesting, and made re-read questions to make sure they weren't tricks. I hope I got them all correct."
## [5] "Worked well"
## [6] "A surprisingly tricky study! Thoroughly enjoyed completing it, despite several red herrings!!"
## [7] "n/a"
## [8] "Thank you for the opportunity."
## [9] "this was challenging"
## [10] "I'm not good at learning history so i might of made couple of mistakes. I hope I did well. :)"
## [11] "Interesting survey - thanks!"
## [12] "no"
## [13] "Regarding the practice question - I'm aware that Alexander Bell invented the telephone, but in reality, it was a collaborative effort by a team of people"
## [14] "Fun study!"
## [15] "Fun stuff"
data_KoF_raw %>% pull(languages) %>% unique## [1] "English" "english" "English, Italian"
## [4] "English/ ASL" "English and Polish" "Chinese"
## [7] "English, Mandarin" "Polish" "Turkish"
## [10] NA "English, Sarcasm" "English, Portuguese"
We might wish to exclude people who do not include “English” as one of their native languages in some studies. Here, we do not since we also have strong, more specific filters on comprehension (see below). Since we are not going to use this information later on, we might as well discard it now:
data_KoF_raw <- data_KoF_raw %>%
select(-languages, -comments, -age, -RT, -education, -gender)But even after pruning irrelevant columns, this data set is still not ideal. We need to preprocess it more thoroughly to make it more intuitively manageable. For example, the information in column trial_name does not give the trial’s name in an intuitive sense, but its type: whether it is a practice or a main trial. But this information, and more, is also represented in the column trial_type. The column item_version contains information about the experimental condition. To see this (mess), the code below prints the selected information from the main trials of only one participant in an order that makes it easier to see what is what.
data_KoF_raw %>%
# ignore practice trials for the moment
# focus on one participant only
filter(trial_type != "practice", submission_id == 192) %>%
select(trial_type, item_version, question) %>%
arrange(desc(trial_type), item_version) %>%
print(n = Inf)## # A tibble: 24 × 3
## trial_type item_version question
## <chr> <chr> <chr>
## 1 special none The Pope is currently not married.
## 2 special none Germany has volcanoes.
## 3 special none France has a king.
## 4 special none Canada is a democracy.
## 5 special none Belgium has rainforests.
## 6 main 0 The volcanoes of Germany dominate the landscape.
## 7 main 1 Canada has an emperor, and he is fond of sushi.
## 8 main 10 Donald Trump, his favorite nature spot is not the Be…
## 9 main 6 The King of France isn’t bald.
## 10 main 9 The Pope’s wife, she did not invite Angela Merkel fo…
## 11 filler none The Solar System includes the planet Earth.
## 12 filler none Vatican City is the world's largest country by land …
## 13 filler none Big Ben is a very large building in the middle of Pa…
## 14 filler none Harry Potter is a series of fantasy novels written b…
## 15 filler none Taj Mahal is a mausoleum on the bank of the river in…
## 16 filler none James Bond is a spanish dancer from Madrid.
## 17 filler none The Pacific Ocean is a large ocean between Japan and…
## 18 filler none Australia has a very large border with Brazil.
## 19 filler none Steve Jobs was an American inventor and co-founder o…
## 20 filler none Planet Earth is part of the galaxy ‘Milky Way’.
## 21 filler none Germany shares borders with France, Belgium and Denm…
## 22 filler none Antarctica is a continent covered almost completely …
## 23 filler none The Statue of Liberty is a colossal sculpture on Lib…
## 24 filler none English is the main language in Australia, Britain a…
We see that the information in item_version specifies the critical condition. To make this more intuitively manageable, we would like to have a column called condition and it should, ideally, also contain useful information for the cases where trial_type is not main or special. That is why we will therefore remove the column trial_name completely, and create an informative column condition in which we learn of every row whether it belongs to one of the five experimental conditions, and if not whether it is a filler or a “background check” (= special) trial.
data_KoF_processed <- data_KoF_raw %>%
# drop redundant information in column `trial_name`
select(-trial_name) %>%
# discard practice trials
filter(trial_type != "practice") %>%
mutate(
# add a 'condition' variable
condition = case_when(
trial_type == "special" ~ "background check",
trial_type == "main" ~ str_c("Condition ", item_version),
TRUE ~ "filler"
) %>%
# make the new 'condition' variable a factor
factor(
ordered = T,
levels = c(
str_c("Condition ", c(0, 1, 6, 9, 10)),
"background check", "filler"
)
)
)4.5.1 Cleaning the data
We clean the data in two consecutive steps:
- Remove all data from any participant who got more than 50% of the answers to the filler material wrong.
- Remove individual main trials if the corresponding “background check” question was answered wrongly.
4.5.1.1 Cleaning by-participant
# look at error rates for filler sentences by subject
# mark every subject as an outlier when they
# have a proportion of correct responses of less than 0.5
subject_error_rate <- data_KoF_processed %>%
filter(trial_type == "filler") %>%
group_by(submission_id) %>%
summarise(
proportion_correct = mean(correct_answer == response),
outlier_subject = proportion_correct < 0.5
) %>%
arrange(proportion_correct)Apply the cleaning step:
# add info about error rates and exclude outlier subject(s)
d_cleaned <-
full_join(data_KoF_processed, subject_error_rate, by = "submission_id") %>%
filter(outlier_subject == FALSE)4.5.1.2 Cleaning by-trial
# exclude every critical trial whose 'background' test question was answered wrongly
d_cleaned <- d_cleaned %>%
# select only the 'background question' trials
filter(trial_type == "special") %>%
# is the background question answered correctly?
mutate(
background_correct = correct_answer == response
) %>%
# select only the relevant columns
select(submission_id, vignette, background_correct) %>%
# right join lines to original data set
right_join(d_cleaned, by = c("submission_id", "vignette")) %>%
# remove all special trials, as well as main trials with incorrect background check
filter(trial_type == "main" & background_correct == TRUE)For later reuse, both the preprocessed and the cleaned data set are included in the aida package as well. They are loaded by calling aida::data_KoF_preprocessed and aida::data_KoF_cleaned, respectively.