4.4 Grouped operations

A frequently occurring problem in data analysis is to obtain a summary statistic (see Chapter 5) for different subsets of data. For example, we might want to calculate the average score for each student in our class. We could do that by filtering like so (notice that pull gives you the column vector specified):

# extracting mean grade for Rozz
mean_grade_Rozz <- exam_results_tidy %>% 
  filter(student == "Rozz") %>% 
  pull(grade) %>% 
  mean
mean_grade_Rozz

## [1] 1.8

But then we need to do that two more times. So, as we shouldn’t copy-paste code, we write a function and use map_dbl to add a mean for each student:

get_mean_for_student <- function(student_name) {
  exam_results_tidy %>% 
  filter(student == student_name) %>% pull(grade) %>% mean
}

map_dbl(
  exam_results_tidy %>% pull(student) %>% unique,
  get_mean_for_student
)

## [1] 1.80 1.85 1.35

Also not quite satisfactory, clumsy and error-prone. Enter, grouping in the tidyverse. If we want to apply a particular operation to all combinations of levels of different variables (no matter whether they are encoded as factors or not when we group), we can do this with the function group_by, followed by either a call to mutate or summarise. Check this example:

exam_results_tidy %>% 
  group_by(student) %>% 
  summarise(
    student_mean = mean(grade)
  )

## # A tibble: 3 × 2
##   student  student_mean
##   <chr>           <dbl>
## 1 Andrew           1.85
## 2 Rozz             1.8 
## 3 Siouxsie         1.35

The function summarise returns a single row for each combination of levels of grouping variables. If we use the function mutate instead, the summary statistic is added (repeatedly) in each of the original rows:

exam_results_tidy %>% 
  group_by(student) %>% 
  mutate(
    student_mean = mean(grade)
  )

## # A tibble: 6 × 4
## # Groups:   student [3]
##   student  exam    grade student_mean
##   <chr>    <chr>   <dbl>        <dbl>
## 1 Rozz     midterm   1.3         1.8 
## 2 Andrew   midterm   2           1.85
## 3 Siouxsie midterm   1.7         1.35
## 4 Rozz     final     2.3         1.8 
## 5 Andrew   final     1.7         1.85
## 6 Siouxsie final     1           1.35

The latter can sometimes be handy, for example when overlaying a plot of the data with grouped means, for instance.

It may be important to remember that after a call of group_by, the resulting tibbles retains the grouping information for all subsequent operations. To remove grouping information, use the function ungroup.