The tidyverse strongly encourages the use of tidy data, or at least almost tidy data. If your data is (almost) tidy, you can be reasonably sure that you can plot and analyze the data without additional wrangling. If your data is not (almost) tidy because it is too wide or too long (see below), what is required is a joyful round of pivoting. There are two directions of pivoting: making data longer, and making data wider.
4.3.1.1 Making too wide data longer with pivot_longer
Consider the previous example of messy data again:
## # A tibble: 2 × 4
## exam Rozz Andrew Siouxsie
## <chr> <chr> <chr> <chr>
## 1 midterm 1.3 2.0 1.7
## 2 final 2.3 1.7 1.0
This data is “too wide”. We can make it longer with the function pivot_longer from the tidyr package. Check out the example below before we plunge into a description of pivot_longer.
exam_results_visual %>%pivot_longer(# pivot every column except the first # (a negative number here means "exclude column with that index number")cols =-1,# name of new column which contains the# names of the columns to be "gathered"names_to ="student",# name of new column which contains the values# of the cells which now form a new columnvalues_to ="grade" ) %>%# optional reordering of columns (to make # the output exactly like `exam_results_tidy`)select(student, exam, grade)
## # A tibble: 6 × 3
## student exam grade
## <chr> <chr> <chr>
## 1 Rozz midterm 1.3
## 2 Andrew midterm 2.0
## 3 Siouxsie midterm 1.7
## 4 Rozz final 2.3
## 5 Andrew final 1.7
## 6 Siouxsie final 1.0
What pivot_longer does, in general, is take a bunch of columns and gather the values of all cells in these columns into a single, new column, the so-called value column, i.e., the column with the values of the cells to be gathered. If pivot_longer stopped here, we would lose information about which cell values belonged to which original column. Therefore, pivot_longer also creates a second new column, the so-called name column, i.e., the column with the names of the original columns that we gathered together. Consequently, in order to do its job, pivot_longer minimally needs three pieces of information:18
which columns to spin around (function argument cols)
the name of the to-be-created new value column (function argument values_to)
the name of the to-be-created new name column (function argument names_to)
For different ways of selecting columns to pivot around, see Section 4.3.3 below.
4.3.1.2 Making too long data wider with pivot_wider
Consider the following example of data which is untidy because it is too long:
mixed_results_too_long <-tibble(student =rep(c('Rozz', 'Andrew', 'Siouxsie'), times =2),what =rep(c('grade', 'participation'), each =3),howmuch =c(2.7, 2.0, 1.0, 75, 93, 33))mixed_results_too_long
This data is untidy because it lumps two types of different measurements (a course grade, and the percentage of participation) in a single column. These are different variables, and so should be represented in different columns.
To fix a data representation that is too long, we can make it wider with the help of the pivot_wider function from the tidyr package. We look at an example before looking at the general behavior of the pivot_wider function.
mixed_results_too_long %>%pivot_wider(# column containing the names of the new columnsnames_from = what,# column containing the values of the new columnsvalues_from = howmuch )
In general, pivot_wider picks out two columns, one column of values to distribute into new to-be-created columns, and one vector of names or groups which contains the information about the, well, names of the to-be-created new columns. There are more refined options for pivot_wider, some of which we will encounter in the context of concrete cases of application.
4.3.2 Subsetting rows & columns
If a data set contains too much information for your current purposes, you can discard irrelevant (or unhelpful) rows and columns. The function filter takes a Boolean expression and returns only those rows of which the Boolean expression is true:
exam_results_tidy %>%# keep only entries with grades better than # or equal to 1.7filter(grade <=1.7)
## # A tibble: 4 × 3
## student exam grade
## <chr> <chr> <dbl>
## 1 Rozz midterm 1.3
## 2 Siouxsie midterm 1.7
## 3 Andrew final 1.7
## 4 Siouxsie final 1
To select rows by an index or a vector of indeces, use the slice function:
exam_results_tidy %>%# keep only entries from rows with an even indexslice(c(2, 4, 6))
## # A tibble: 3 × 3
## student exam grade
## <chr> <chr> <dbl>
## 1 Andrew midterm 2
## 2 Rozz final 2.3
## 3 Siouxsie final 1
The function select allows to pick out a subset of columns. Interestingly, it can also be used to reorder columns, because the order in which column names are specified matches the order in the returned tibble.
exam_results_tidy %>%# select columns `grade` and `exam`select(grade, exam)
## # A tibble: 6 × 2
## grade exam
## <dbl> <chr>
## 1 1.3 midterm
## 2 2 midterm
## 3 1.7 midterm
## 4 2.3 final
## 5 1.7 final
## 6 1 final
4.3.3 Tidy selection of column names
To select the columns in several functions within the tidyverse, such as pivot_longer or select, there are useful helper functions from the tidyselect package. Here are some examples:19
# bogus code for illustration of possibilities!SOME_DATA %>%select( ... # could be one of the following# all columns indexed 2, 3, ..., 102:10# all columns except the one called "COLNAME"- COLNAME# all columns with names starting with "STRING"starts_with("STRING")# all columns with names ending with "STRING"ends_with("STRING")# all columns with names containing "STRING"contains("STRING")# all columns with names of the form "Col_i" with i = 1, ..., 10num_range("Col_", 1:10) )
4.3.4 Adding, changing and renaming columns
To add a new column, or to change an existing one use the function mutate, like so:
exam_results_tidy %>%mutate(# add a new column called 'passed' depending on grade# [NB: severe passing conditions in this class!!]passed = grade <=1.7, # change an existing column; here: change# character column 'exam' to ordered factorexam =factor(exam, ordered = T) )
## # A tibble: 6 × 4
## student exam grade passed
## <chr> <ord> <dbl> <lgl>
## 1 Rozz midterm 1.3 TRUE
## 2 Andrew midterm 2 FALSE
## 3 Siouxsie midterm 1.7 TRUE
## 4 Rozz final 2.3 FALSE
## 5 Andrew final 1.7 TRUE
## 6 Siouxsie final 1 TRUE
If you want to rename a column, function rename is what you want:
exam_results_tidy %>%# rename existing column "student" to new name "participant"# [NB: rename takes the new name first]rename(participant = student)
## # A tibble: 6 × 3
## participant exam grade
## <chr> <chr> <dbl>
## 1 Rozz midterm 1.3
## 2 Andrew midterm 2
## 3 Siouxsie midterm 1.7
## 4 Rozz final 2.3
## 5 Andrew final 1.7
## 6 Siouxsie final 1
This is not a useful representation format. Results of three homework sets are mushed together in a single column. Each value is separated by a comma, but it is all stored as a character vector.
To disentangle information in a single column, use the separate function:
homework_results_untidy %>%separate(# which column to split upcol = results,# names of the new column to store resultsinto =str_c("HW_", 1:3),# separate by which character / reg-expsep =",",# automatically (smart-)convert the type of the new colsconvert = T )
If you have a reason to perform the reverse operation, i.e., join together several columns, use the unite function.
4.3.6 Sorting a data set
If you want to indicate a fixed order of the reoccurring elements in a (character) vector, e.g., for plotting in a particular order, you should make this column an ordered factor. But if you want to order a data set along a column, e.g., for inspection or printing as a table, then you can do that by using the arrange function. You can specify several columns to sort alpha-numerically in ascending order, and also indicate a descending order using the desc function:
## # A tibble: 6 × 3
## student exam grade
## <chr> <chr> <dbl>
## 1 Siouxsie final 1
## 2 Siouxsie midterm 1.7
## 3 Rozz midterm 1.3
## 4 Rozz final 2.3
## 5 Andrew final 1.7
## 6 Andrew midterm 2
4.3.7 Combining tibbles
There are frequent occasions on which data from two separate variables need to be combined. The simplest case is where two entirely disjoint data sets merely need to be glued together, either horizontally (binding columns together with function cbind) or vertically (binding rows together with function rbind).
## # A tibble: 9 × 3
## student exam grade
## <chr> <chr> <dbl>
## 1 Rozz midterm 1.3
## 2 Andrew midterm 2
## 3 Siouxsie midterm 1.7
## 4 Rozz final 2.3
## 5 Andrew final 1.7
## 6 Siouxsie final 1
## 7 Rozz bonus 1.7
## 8 Andrew bonus 2.3
## 9 Siouxsie bonus 1
If two data sets have information in common, and the combination should respect that commonality, the join family of functions is of great help. Consider the case of distributed information again that we looked at to understand the third constraint of the concept of “tidy data”. There are two tibbles, both of which contain information about the same students. They share the column student (this does not necessarily have to be in the same order!) and we might want to join the information from both sources into a single (messy but almost tidy) representation, using full_join. We have seen an example already, which is repeated here:
# same as beforeexam_results_tidy <-tribble(~student, ~exam, ~grade,"Rozz", "midterm", 1.3,"Andrew", "midterm", 2.0,"Siouxsie", "midterm", 1.7,"Rozz", "final", 2.3,"Andrew", "final", 1.7,"Siouxsie", "final", 1.0)# additional table with student numbersstudent_numbers <-tribble(~student, ~student_number,"Rozz", "666", "Andrew", "1969","Siouxsie", "3.14")full_join(exam_results_tidy, student_numbers, by ="student")
## # A tibble: 6 × 4
## student exam grade student_number
## <chr> <chr> <dbl> <chr>
## 1 Rozz midterm 1.3 666
## 2 Andrew midterm 2 1969
## 3 Siouxsie midterm 1.7 3.14
## 4 Rozz final 2.3 666
## 5 Andrew final 1.7 1969
## 6 Siouxsie final 1 3.14
If two data sets are to be joined by a column that is not exactly shared by both sets (one contains entries in this column that the other doesn’t) then a full_join will retain all information from both. If that is not what you want, check out alternative functions like right_join, semi_join etc. using the data wrangling cheat sheet.
Exercise 4.2: Data Wrangling in R
We are working with the same example as in the earlier exercise:
Selecting two columns (subject_id and choices) out of the data set.
In the data set, each cell in the choices column contains more than one value. To separate them, we take this column and divide the strings by the “,”. The names are then given for each line from one to three.
Now we are making the data set longer, so that each condition is its own row. We are pivoting each column apart from the first. The names of the columns are combined in a column called condition and the values are put into a column called response.
The result:
choice_data
## # A tibble: 12 × 3
## subject_id condition response
## <dbl> <chr> <chr>
## 1 1 C_1 A
## 2 1 C_2 B
## 3 1 C_3 B
## 4 2 C_1 B
## 5 2 C_2 A
## 6 2 C_3 B
## 7 3 C_1 B
## 8 3 C_2 A
## 9 3 C_3 A
## 10 4 C_1 A
## 11 4 C_2 B
## 12 4 C_3 B
There are alternative possibilities for specifying names of the value and name column, which allow for more dynamic construction of strings. We will not cover all of these details here, but we will use some of these alternative specifications in subsequent examples.↩︎
The helpers from the tidyselect package also accept regular expressions.↩︎