4.3 Data manipulation: the basics

4.3.1 Pivoting

The tidyverse strongly encourages the use of tidy data, or at least almost tidy data. If your data is (almost) tidy, you can be reasonably sure that you can plot and analyze the data without additional wrangling. If your data is not (almost) tidy because it is too wide or too long (see below), what is required is a joyful round of pivoting. There are two directions of pivoting: making data longer, and making data wider.

4.3.1.1 Making too wide data longer with pivot_longer

Consider the previous example of messy data again:

exam_results_visual <- tribble(
  ~exam,       ~"Rozz",   ~"Andrew",   ~"Siouxsie",
  "midterm",   "1.3",     "2.0",       "1.7",
  "final"  ,   "2.3",     "1.7",       "1.0"
)
exam_results_visual
## # A tibble: 2 × 4
##   exam    Rozz  Andrew Siouxsie
##   <chr>   <chr> <chr>  <chr>   
## 1 midterm 1.3   2.0    1.7     
## 2 final   2.3   1.7    1.0

This data is “too wide”. We can make it longer with the function pivot_longer from the tidyr package. Check out the example below before we plunge into a description of pivot_longer.

exam_results_visual %>% 
  pivot_longer(
    # pivot every column except the first 
    # (a negative number here means "exclude column with that index number")
    cols = - 1,
    # name of new column which contains the
    # names of the columns to be "gathered"
    names_to = "student",
    # name of new column which contains the values
    # of the cells which now form a new column
    values_to = "grade"
  ) %>% 
  # optional reordering of columns (to make 
  # the output exactly like `exam_results_tidy`)
  select(student, exam, grade)
## # A tibble: 6 × 3
##   student  exam    grade
##   <chr>    <chr>   <chr>
## 1 Rozz     midterm 1.3  
## 2 Andrew   midterm 2.0  
## 3 Siouxsie midterm 1.7  
## 4 Rozz     final   2.3  
## 5 Andrew   final   1.7  
## 6 Siouxsie final   1.0

What pivot_longer does, in general, is take a bunch of columns and gather the values of all cells in these columns into a single, new column, the so-called value column, i.e., the column with the values of the cells to be gathered. If pivot_longer stopped here, we would lose information about which cell values belonged to which original column. Therefore, pivot_longer also creates a second new column, the so-called name column, i.e., the column with the names of the original columns that we gathered together. Consequently, in order to do its job, pivot_longer minimally needs three pieces of information:18

  1. which columns to spin around (function argument cols)
  2. the name of the to-be-created new value column (function argument values_to)
  3. the name of the to-be-created new name column (function argument names_to)

For different ways of selecting columns to pivot around, see Section 4.3.3 below.

4.3.1.2 Making too long data wider with pivot_wider

Consider the following example of data which is untidy because it is too long:

mixed_results_too_long <- 
  tibble(student = rep(c('Rozz', 'Andrew', 'Siouxsie'), times = 2),
         what    = rep(c('grade', 'participation'), each = 3),
         howmuch = c(2.7, 2.0, 1.0, 75, 93, 33))
mixed_results_too_long
## # A tibble: 6 × 3
##   student  what          howmuch
##   <chr>    <chr>           <dbl>
## 1 Rozz     grade             2.7
## 2 Andrew   grade             2  
## 3 Siouxsie grade             1  
## 4 Rozz     participation    75  
## 5 Andrew   participation    93  
## 6 Siouxsie participation    33

This data is untidy because it lumps two types of different measurements (a course grade, and the percentage of participation) in a single column. These are different variables, and so should be represented in different columns.

To fix a data representation that is too long, we can make it wider with the help of the pivot_wider function from the tidyr package. We look at an example before looking at the general behavior of the pivot_wider function.

mixed_results_too_long %>% 
  pivot_wider(
    # column containing the names of the new columns
    names_from = what,
    # column containing the values of the new columns
    values_from = howmuch
  )
## # A tibble: 3 × 3
##   student  grade participation
##   <chr>    <dbl>         <dbl>
## 1 Rozz       2.7            75
## 2 Andrew     2              93
## 3 Siouxsie   1              33

In general, pivot_wider picks out two columns, one column of values to distribute into new to-be-created columns, and one vector of names or groups which contains the information about the, well, names of the to-be-created new columns. There are more refined options for pivot_wider, some of which we will encounter in the context of concrete cases of application.

4.3.2 Subsetting rows & columns

If a data set contains too much information for your current purposes, you can discard irrelevant (or unhelpful) rows and columns. The function filter takes a Boolean expression and returns only those rows of which the Boolean expression is true:

exam_results_tidy %>% 
  # keep only entries with grades better than 
  # or equal to 1.7
  filter(grade <= 1.7)
## # A tibble: 4 × 3
##   student  exam    grade
##   <chr>    <chr>   <dbl>
## 1 Rozz     midterm   1.3
## 2 Siouxsie midterm   1.7
## 3 Andrew   final     1.7
## 4 Siouxsie final     1

To select rows by an index or a vector of indeces, use the slice function:

exam_results_tidy %>% 
  # keep only entries from rows with an even index
  slice(c(2, 4, 6)) 
## # A tibble: 3 × 3
##   student  exam    grade
##   <chr>    <chr>   <dbl>
## 1 Andrew   midterm   2  
## 2 Rozz     final     2.3
## 3 Siouxsie final     1

The function select allows to pick out a subset of columns. Interestingly, it can also be used to reorder columns, because the order in which column names are specified matches the order in the returned tibble.

exam_results_tidy %>% 
  # select columns `grade` and `exam`
  select(grade, exam)
## # A tibble: 6 × 2
##   grade exam   
##   <dbl> <chr>  
## 1   1.3 midterm
## 2   2   midterm
## 3   1.7 midterm
## 4   2.3 final  
## 5   1.7 final  
## 6   1   final

4.3.3 Tidy selection of column names

To select the columns in several functions within the tidyverse, such as pivot_longer or select, there are useful helper functions from the tidyselect package. Here are some examples:19

# bogus code for illustration of possibilities!
SOME_DATA %>% 
  select( ... # could be one of the following
        # all columns indexed 2, 3, ..., 10
        2:10
        # all columns except the one called "COLNAME"
        - COLNAME
        # all columns with names starting with "STRING"
        starts_with("STRING")
        # all columns with names ending with "STRING"
        ends_with("STRING")
        # all columns with names containing "STRING"
        contains("STRING")
        # all columns with names of the form "Col_i" with i = 1, ..., 10
        num_range("Col_", 1:10)
  )

4.3.4 Adding, changing and renaming columns

To add a new column, or to change an existing one use the function mutate, like so:

exam_results_tidy %>% 
  mutate(
    # add a new column called 'passed' depending on grade
    # [NB: severe passing conditions in this class!!]
    passed = grade <= 1.7, 
    # change an existing column; here: change
    # character column 'exam' to ordered factor
    exam = factor(exam, ordered = T)
  )
## # A tibble: 6 × 4
##   student  exam    grade passed
##   <chr>    <ord>   <dbl> <lgl> 
## 1 Rozz     midterm   1.3 TRUE  
## 2 Andrew   midterm   2   FALSE 
## 3 Siouxsie midterm   1.7 TRUE  
## 4 Rozz     final     2.3 FALSE 
## 5 Andrew   final     1.7 TRUE  
## 6 Siouxsie final     1   TRUE

If you want to rename a column, function rename is what you want:

exam_results_tidy %>% 
  # rename existing column "student" to new name "participant"
  # [NB: rename takes the new name first]
  rename(participant = student)
## # A tibble: 6 × 3
##   participant exam    grade
##   <chr>       <chr>   <dbl>
## 1 Rozz        midterm   1.3
## 2 Andrew      midterm   2  
## 3 Siouxsie    midterm   1.7
## 4 Rozz        final     2.3
## 5 Andrew      final     1.7
## 6 Siouxsie    final     1

4.3.5 Splitting and uniting columns

Here is data from course homework:

homework_results_untidy <- 
  tribble(
    ~student,      ~results,
    "Rozz",        "1.0,2.3,3.0",
    "Andrew",      "2.3,2.7,1.3",
    "Siouxsie",    "1.7,4.0,1.0"
  )

This is not a useful representation format. Results of three homework sets are mushed together in a single column. Each value is separated by a comma, but it is all stored as a character vector.

To disentangle information in a single column, use the separate function:

homework_results_untidy %>% 
  separate(
    # which column to split up
    col = results,
    # names of the new column to store results
    into = str_c("HW_", 1:3),
    # separate by which character / reg-exp
    sep = ",",
    # automatically (smart-)convert the type of the new cols
    convert = T 
    )
## # A tibble: 3 × 4
##   student   HW_1  HW_2  HW_3
##   <chr>    <dbl> <dbl> <dbl>
## 1 Rozz       1     2.3   3  
## 2 Andrew     2.3   2.7   1.3
## 3 Siouxsie   1.7   4     1

If you have a reason to perform the reverse operation, i.e., join together several columns, use the unite function.

4.3.6 Sorting a data set

If you want to indicate a fixed order of the reoccurring elements in a (character) vector, e.g., for plotting in a particular order, you should make this column an ordered factor. But if you want to order a data set along a column, e.g., for inspection or printing as a table, then you can do that by using the arrange function. You can specify several columns to sort alpha-numerically in ascending order, and also indicate a descending order using the desc function:

exam_results_tidy %>% 
  arrange(desc(student), grade)
## # A tibble: 6 × 3
##   student  exam    grade
##   <chr>    <chr>   <dbl>
## 1 Siouxsie final     1  
## 2 Siouxsie midterm   1.7
## 3 Rozz     midterm   1.3
## 4 Rozz     final     2.3
## 5 Andrew   final     1.7
## 6 Andrew   midterm   2

4.3.7 Combining tibbles

There are frequent occasions on which data from two separate variables need to be combined. The simplest case is where two entirely disjoint data sets merely need to be glued together, either horizontally (binding columns together with function cbind) or vertically (binding rows together with function rbind).

new_exam_results_tidy <- tribble(
  ~student,    ~exam,      ~grade,
  "Rozz",      "bonus",  1.7,
  "Andrew",    "bonus",  2.3,
  "Siouxsie",  "bonus",  1.0
)
rbind(
  exam_results_tidy, 
  new_exam_results_tidy
)
## # A tibble: 9 × 3
##   student  exam    grade
##   <chr>    <chr>   <dbl>
## 1 Rozz     midterm   1.3
## 2 Andrew   midterm   2  
## 3 Siouxsie midterm   1.7
## 4 Rozz     final     2.3
## 5 Andrew   final     1.7
## 6 Siouxsie final     1  
## 7 Rozz     bonus     1.7
## 8 Andrew   bonus     2.3
## 9 Siouxsie bonus     1

If two data sets have information in common, and the combination should respect that commonality, the join family of functions is of great help. Consider the case of distributed information again that we looked at to understand the third constraint of the concept of “tidy data”. There are two tibbles, both of which contain information about the same students. They share the column student (this does not necessarily have to be in the same order!) and we might want to join the information from both sources into a single (messy but almost tidy) representation, using full_join. We have seen an example already, which is repeated here:

# same as before
exam_results_tidy <- tribble(
  ~student,    ~exam,      ~grade,
  "Rozz",      "midterm",  1.3,
  "Andrew",    "midterm",  2.0,
  "Siouxsie",  "midterm",  1.7,
  "Rozz",      "final",    2.3,
  "Andrew",    "final",    1.7,
  "Siouxsie",  "final",    1.0
)
# additional table with student numbers
student_numbers <- tribble(
  ~student,    ~student_number,
  "Rozz",      "666",   
  "Andrew",    "1969",
  "Siouxsie",  "3.14"
)
full_join(exam_results_tidy, student_numbers, by = "student")
## # A tibble: 6 × 4
##   student  exam    grade student_number
##   <chr>    <chr>   <dbl> <chr>         
## 1 Rozz     midterm   1.3 666           
## 2 Andrew   midterm   2   1969          
## 3 Siouxsie midterm   1.7 3.14          
## 4 Rozz     final     2.3 666           
## 5 Andrew   final     1.7 1969          
## 6 Siouxsie final     1   3.14

If two data sets are to be joined by a column that is not exactly shared by both sets (one contains entries in this column that the other doesn’t) then a full_join will retain all information from both. If that is not what you want, check out alternative functions like right_join, semi_join etc. using the data wrangling cheat sheet.

Exercise 4.2: Data Wrangling in R

We are working with the same example as in the earlier exercise:

data <- tribble(
~subject_id,  ~choices,  ~reaction_times, 
1,            "A,B,B",   "312 433 365", 
2,            "B,A,B",   "393 491 327",
3,            "B,A,A",   "356 313 475", 
4,            "A,B,B",   "292 352 378" 
)

Take a look at the following code snippet. Explain what the individual parts (indicated by the numbers) do. What will the result look like?

choice_data <- data %>%  
#1 
select(subject_id, choices) %>%  
#2
separate(
  col = choices,
  into = str_c("C_", 1:3),  
  sep = ",") %>%  
#3 
pivot_longer( 
  cols = -1, 
  names_to = "condition",
  values_to = "response")
  1. Selecting two columns (subject_id and choices) out of the data set.

  2. In the data set, each cell in the choices column contains more than one value. To separate them, we take this column and divide the strings by the “,”. The names are then given for each line from one to three.

  3. Now we are making the data set longer, so that each condition is its own row. We are pivoting each column apart from the first. The names of the columns are combined in a column called condition and the values are put into a column called response.

The result:

choice_data
## # A tibble: 12 × 3
##    subject_id condition response
##         <dbl> <chr>     <chr>   
##  1          1 C_1       A       
##  2          1 C_2       B       
##  3          1 C_3       B       
##  4          2 C_1       B       
##  5          2 C_2       A       
##  6          2 C_3       B       
##  7          3 C_1       B       
##  8          3 C_2       A       
##  9          3 C_3       A       
## 10          4 C_1       A       
## 11          4 C_2       B       
## 12          4 C_3       B

  1. There are alternative possibilities for specifying names of the value and name column, which allow for more dynamic construction of strings. We will not cover all of these details here, but we will use some of these alternative specifications in subsequent examples.↩︎

  2. The helpers from the tidyselect package also accept regular expressions.↩︎