4.5 Case study: the King of France
Let’s go through one case study of data preprocessing and cleaning. We look at the example introduced and fully worked out in Appendix D.3. (Please read Section D.3.1 to find out more about where this data set is coming from.)
The raw data set is part of the aida
package and can be loaded using:
<- aida::data_KoF_raw data_KoF_raw
We then take a glimpse at the data:
glimpse(data_KoF_raw )
## Rows: 2,813
## Columns: 16
## $ submission_id <dbl> 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, …
## $ RT <dbl> 8110, 35557, 3647, 16037, 11816, 6024, 4986, 13019, 538…
## $ age <dbl> 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,…
## $ comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ item_version <chr> "none", "none", "none", "none", "none", "none", "none",…
## $ correct_answer <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ education <chr> "Graduated College", "Graduated College", "Graduated Co…
## $ gender <chr> "female", "female", "female", "female", "female", "fema…
## $ languages <chr> "English", "English", "English", "English", "English", …
## $ question <chr> "World War II was a global war that lasted from 1914 to…
## $ response <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
## $ timeSpent <dbl> 39.48995, 39.48995, 39.48995, 39.48995, 39.48995, 39.48…
## $ trial_name <chr> "practice_trials", "practice_trials", "practice_trials"…
## $ trial_number <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ trial_type <chr> "practice", "practice", "practice", "practice", "practi…
## $ vignette <chr> "undefined", "undefined", "undefined", "undefined", "un…
The variables in this data set are:
submission_id
: unique identifier for each participantRT
: the reaction time for each decisionage
: the (self-reported) age of the participantcomments
: the (optional) comments each participant may have givenitem_version
: the condition which the test sentence belongs to (only given for trials of typemain
andspecial
)correct_answer
: for trials of typefiller
andspecial
what the true answer should have beeneducation
: the (self-reported) education level with optionsGraduated College
,Graduated High School
,Higher Degree
gender
: (self-reported) genderlanguages
: (self-reported) native languagesquestion
: the sentence to be judged true or falseresponse
: the answer (“TRUE” or “FALSE”) on each trialtrial_name
: whether the trial is a main or practice trials (levelsmain_trials
andpractice_trials
)trial_number
: consecutive numbering of each participant’s trialtrial_type
: whether the trial was of the categoryfiller
,main
,practice
orspecial
, where the latter encodes the “background checks”vignette
: the current item’s vignette number (applies only to trials of typemain
andspecial
)
Let’s have a brief look at the comments (sometimes helpful, usually entertaining) and the self-reported native languages:
%>% pull(comments) %>% unique data_KoF_raw
## [1] NA
## [2] "I hope I was right most of the time!"
## [3] "My level of education is Some Highschool, not finished. So I couldn't input what was correct, so I'm leaving a comment here."
## [4] "It was interesting, and made re-read questions to make sure they weren't tricks. I hope I got them all correct."
## [5] "Worked well"
## [6] "A surprisingly tricky study! Thoroughly enjoyed completing it, despite several red herrings!!"
## [7] "n/a"
## [8] "Thank you for the opportunity."
## [9] "this was challenging"
## [10] "I'm not good at learning history so i might of made couple of mistakes. I hope I did well. :)"
## [11] "Interesting survey - thanks!"
## [12] "no"
## [13] "Regarding the practice question - I'm aware that Alexander Bell invented the telephone, but in reality, it was a collaborative effort by a team of people"
## [14] "Fun study!"
## [15] "Fun stuff"
%>% pull(languages) %>% unique data_KoF_raw
## [1] "English" "english" "English, Italian"
## [4] "English/ ASL" "English and Polish" "Chinese"
## [7] "English, Mandarin" "Polish" "Turkish"
## [10] NA "English, Sarcasm" "English, Portuguese"
We might wish to exclude people who do not include “English” as one of their native languages in some studies. Here, we do not since we also have strong, more specific filters on comprehension (see below). Since we are not going to use this information later on, we might as well discard it now:
<- data_KoF_raw %>%
data_KoF_raw select(-languages, -comments, -age, -RT, -education, -gender)
But even after pruning irrelevant columns, this data set is still not ideal. We need to preprocess it more thoroughly to make it more intuitively manageable. For example, the information in column trial_name
does not give the trial’s name in an intuitive sense, but its type: whether it is a practice or a main trial. But this information, and more, is also represented in the column trial_type
. The column item_version
contains information about the experimental condition. To see this (mess), the code below prints the selected information from the main trials of only one participant in an order that makes it easier to see what is what.
%>%
data_KoF_raw # ignore practice trials for the moment
# focus on one participant only
filter(trial_type != "practice", submission_id == 192) %>%
select(trial_type, item_version, question) %>%
arrange(desc(trial_type), item_version) %>%
print(n = Inf)
## # A tibble: 24 × 3
## trial_type item_version question
## <chr> <chr> <chr>
## 1 special none The Pope is currently not married.
## 2 special none Germany has volcanoes.
## 3 special none France has a king.
## 4 special none Canada is a democracy.
## 5 special none Belgium has rainforests.
## 6 main 0 The volcanoes of Germany dominate the landscape.
## 7 main 1 Canada has an emperor, and he is fond of sushi.
## 8 main 10 Donald Trump, his favorite nature spot is not the Be…
## 9 main 6 The King of France isn’t bald.
## 10 main 9 The Pope’s wife, she did not invite Angela Merkel fo…
## 11 filler none The Solar System includes the planet Earth.
## 12 filler none Vatican City is the world's largest country by land …
## 13 filler none Big Ben is a very large building in the middle of Pa…
## 14 filler none Harry Potter is a series of fantasy novels written b…
## 15 filler none Taj Mahal is a mausoleum on the bank of the river in…
## 16 filler none James Bond is a spanish dancer from Madrid.
## 17 filler none The Pacific Ocean is a large ocean between Japan and…
## 18 filler none Australia has a very large border with Brazil.
## 19 filler none Steve Jobs was an American inventor and co-founder o…
## 20 filler none Planet Earth is part of the galaxy ‘Milky Way’.
## 21 filler none Germany shares borders with France, Belgium and Denm…
## 22 filler none Antarctica is a continent covered almost completely …
## 23 filler none The Statue of Liberty is a colossal sculpture on Lib…
## 24 filler none English is the main language in Australia, Britain a…
We see that the information in item_version
specifies the critical condition. To make this more intuitively manageable, we would like to have a column called condition
and it should, ideally, also contain useful information for the cases where trial_type
is not main
or special
. That is why we will therefore remove the column trial_name
completely, and create an informative column condition
in which we learn of every row whether it belongs to one of the five experimental conditions, and if not whether it is a filler or a “background check” (= special) trial.
<- data_KoF_raw %>%
data_KoF_processed # drop redundant information in column `trial_name`
select(-trial_name) %>%
# discard practice trials
filter(trial_type != "practice") %>%
mutate(
# add a 'condition' variable
condition = case_when(
== "special" ~ "background check",
trial_type == "main" ~ str_c("Condition ", item_version),
trial_type TRUE ~ "filler"
%>%
) # make the new 'condition' variable a factor
factor(
ordered = T,
levels = c(
str_c("Condition ", c(0, 1, 6, 9, 10)),
"background check", "filler"
)
) )
4.5.1 Cleaning the data
We clean the data in two consecutive steps:
- Remove all data from any participant who got more than 50% of the answers to the filler material wrong.
- Remove individual main trials if the corresponding “background check” question was answered wrongly.
4.5.1.1 Cleaning by-participant
# look at error rates for filler sentences by subject
# mark every subject as an outlier when they
# have a proportion of correct responses of less than 0.5
<- data_KoF_processed %>%
subject_error_rate filter(trial_type == "filler") %>%
group_by(submission_id) %>%
summarise(
proportion_correct = mean(correct_answer == response),
outlier_subject = proportion_correct < 0.5
%>%
) arrange(proportion_correct)
Apply the cleaning step:
# add info about error rates and exclude outlier subject(s)
<-
d_cleaned full_join(data_KoF_processed, subject_error_rate, by = "submission_id") %>%
filter(outlier_subject == FALSE)
4.5.1.2 Cleaning by-trial
# exclude every critical trial whose 'background' test question was answered wrongly
<- d_cleaned %>%
d_cleaned # select only the 'background question' trials
filter(trial_type == "special") %>%
# is the background question answered correctly?
mutate(
background_correct = correct_answer == response
%>%
) # select only the relevant columns
select(submission_id, vignette, background_correct) %>%
# right join lines to original data set
right_join(d_cleaned, by = c("submission_id", "vignette")) %>%
# remove all special trials, as well as main trials with incorrect background check
filter(trial_type == "main" & background_correct == TRUE)
For later reuse, both the preprocessed and the cleaned data set are included in the aida
package as well. They are loaded by calling aida::data_KoF_preprocessed
and aida::data_KoF_cleaned
, respectively.