4.4 Grouped operations
A frequently occurring problem in data analysis is to obtain a summary statistic (see Chapter 5) for different subsets of data. For example, we might want to calculate the average score for each student in our class. We could do that by filtering like so (notice that pull
gives you the column vector specified):
# extracting mean grade for Rozz
<- exam_results_tidy %>%
mean_grade_Rozz filter(student == "Rozz") %>%
pull(grade) %>%
mean mean_grade_Rozz
## [1] 1.8
But then we need to do that two more times. So, as we shouldn’t copy-paste code, we write a function and use map_dbl
to add a mean for each student:
<- function(student_name) {
get_mean_for_student %>%
exam_results_tidy filter(student == student_name) %>% pull(grade) %>% mean
}
map_dbl(
%>% pull(student) %>% unique,
exam_results_tidy
get_mean_for_student )
## [1] 1.80 1.85 1.35
Also not quite satisfactory, clumsy and error-prone. Enter, grouping in the tidyverse. If we want to apply a particular operation to all combinations of levels of different variables (no matter whether they are encoded as factors or not when we group), we can do this with the function group_by
, followed by either a call to mutate
or summarise
. Check this example:
%>%
exam_results_tidy group_by(student) %>%
summarise(
student_mean = mean(grade)
)
## # A tibble: 3 × 2
## student student_mean
## <chr> <dbl>
## 1 Andrew 1.85
## 2 Rozz 1.8
## 3 Siouxsie 1.35
The function summarise
returns a single row for each combination of levels of grouping variables. If we use the function mutate
instead, the summary statistic is added (repeatedly) in each of the original rows:
%>%
exam_results_tidy group_by(student) %>%
mutate(
student_mean = mean(grade)
)
## # A tibble: 6 × 4
## # Groups: student [3]
## student exam grade student_mean
## <chr> <chr> <dbl> <dbl>
## 1 Rozz midterm 1.3 1.8
## 2 Andrew midterm 2 1.85
## 3 Siouxsie midterm 1.7 1.35
## 4 Rozz final 2.3 1.8
## 5 Andrew final 1.7 1.85
## 6 Siouxsie final 1 1.35
The latter can sometimes be handy, for example when overlaying a plot of the data with grouped means, for instance.
It may be important to remember that after a call of group_by
, the resulting tibbles retains the grouping information for all subsequent operations. To remove grouping information, use the function ungroup
.