Plot 1: Code 4

Plot 2: Code 1

Plot 3: Code 2

`ggplot`

In this section, we will work towards a first plot with `ggplot`

. It will be a scatter plot (more on different kinds of plots in Section 6.4) for the avocado price data. Check out the ggplot cheat sheet for a quick overview of the nuts and bolts of ggplot.

The following paragraphs introduce the key concepts of `ggplot`

:

**incremental composition**: adding elements or changing attributes of a plot incrementally**convenience functions & defaults**: a closer look at high-level convenience functions (like`geom_point`

) and what they actually do**layers**: seeing how layers are stacked when we call, e.g. different`geom_`

functions in sequence**grouping**: what happens when we use grouping information (e.g., for color, shape or in facets)

The section finishes with a first full example of a plot that has different layers, uses grouping, and customizes a few other things.

To get started, let’s first load the (preprocessed) avocado data set used for plotting:

`<- aida::data_avocado avocado_data `

The “gg” in the package name `ggplot`

is short for “grammar of graphs”. It provides functions for describing scientific data plots in a compositional manner, i.e., for dealing with different recurrent elements in a plot in an additive way. As a result of this approach, we will use the symbol `+`

to *add* more and more elements (or to override the implicit defaults in previously evoked elements) to build a plot. For example, we can obtain a scatter plot for the avocado price data simply by first calling the function `ggplot`

, which just creates an empty plot:

`<- ggplot() incrementally_built_plot `

The plot stored in variable `incrementally_built_plot`

is very boring. Take a look:

` incrementally_built_plot`

As you can see, you do not see anything except a (white) canvas. But we can add some stuff. Don’t get hung up on the details right now, just notice that we use `+`

to add stuff to our plot:^{29}

```
+
incrementally_built_plot # add a geom of type `point` (=> scatter plot)
geom_point(
# what data to use
data = avocado_data,
# supply a mapping (in the form of an 'aesthetic' (see below))
mapping = aes(
# which variable to map onto the x-axis
x = total_volume_sold,
# which variable to map onto the y-axis
y = average_price
) )
```

You see that the function `geom_point`

is what makes the points appear. You tell it which data to use and which mapping of variables from the data set to elements in the plot you like. That’s it, at least to begin with.

We can also supply the information about the data to use and the aesthetic mapping in the `ggplot`

function call. Doing so will make this information the default for any subsequently added layer. Notice also that the `data`

argument in function `ggplot`

is the first argument, so we will frequently make use of piping, like in the following code which is equivalent to the previous in terms of output:

```
%>%
avocado_data ggplot(aes(x = total_volume_sold, y = average_price)) +
geom_point()
```

Let’s take a step back. Actually, the function `geom_point`

is a convenience function that does a lot of things automatically for us. It helps to understand subsequent code if we peek under the hood at least for a brief moment initially, if only to just realize where some of the terminology in and around the “grammar of graphs” comes from.

The `ggplot`

package defines a **layered grammar of graphs** (Wickham 2010). This is a structured description language for plots (relevant for data science). It uses a smart system of defaults so that it suffices to often just call a convenience wrapper like `geom_point`

. But underneath, there is the possibility of tinkering with (almost?) all of the (layered) elements and changing the defaults if need be.

The process of mapping data onto a visualization essentially follows this route:

data -> statistical transformation -> geom. object -> aesthetics

You supply (tidy) data. The data is then transformed (e.g., by computing a summary statistic) in some way or another. This could just be an “identity map” in which case you will visualize the data exactly as it is. The resulting data representation is mapped onto some spatial (geometric) appearance, like a line, a dot, or a geometric shape. Finally, there is room to alter the specific aesthetics of this mapping from data to visual object, like adjusting the size or the color of a geometric object, possibly depending on some other properties it has (e.g., whether it is an observation for a conventional or an organically grown avocado).

To make explicit the steps which are implicitly carried out by `geom_point`

in the example above, here is a fully verbose but output-equivalent sequence of commands that builds the same plot by defining all the basic components manually:

```
%>%
avocado_data ggplot() +
# plot consists of layers (more on this soon)
layer(
# how to map columns onto ingredients in the plot
mapping = aes(x = total_volume_sold, y = average_price),
# what statistical transformation should be used? - here: none
stat = "identity",
# how should the transformed data be visually represented? - here: as points
geom = "point",
# should we tinker in any other way with the positioning of each element?
# - here: no, thank you!
position = "identity"
+
) # x and y axes are non-transformed continuous
scale_x_continuous() +
scale_y_continuous() +
# we use a cartesian coordinate system (not a polar or a geographical map)
coord_cartesian()
```

In this explicit call, we still need to specify the data and the mapping (which variable to map onto which axis). But we need to specify much more. We tell `ggplot`

that we want standard (e.g., not log-transformed) axes. We also tell it that our axes are continuous, that the data should not be transformed and that the visual shape (= geom) to which the data is to be mapped is a point (hence the name `geom_point`

).

It is not important to understand all of these components right now. It is important to have seen them once, and to understand that `geom_point`

is a wrapper around this call which assumes reasonable defaults (such as non-transformed axes, points for representation etc.).

`ggplot`

is the “grammar of *layered* graphs”. Plots are compositionally built by combining different layers, if need be. For example, we can use another function from the `geom_`

family of functions to display a different visualization derived from the same data on top of our previous scatter plot.^{30}

```
%>%
avocado_data ggplot(
mapping = aes(
# notice that we use the log (try without it to understand why)
x = log(total_volume_sold),
y = average_price
)+
) # add a scatter plot
geom_point() +
# add a linear regression line
geom_smooth(method = "lm")
```

Notice that layering is really sequential. To see this, just check what happens when we reverse the calls of the `geom_`

functions in the previous example:

```
%>%
avocado_data ggplot(
mapping = aes(
# notice that we use the log (try without it to understand why)
x = log(total_volume_sold),
y = average_price
)+
) # FIRST: add a linear regression line
geom_smooth(method = "lm") +
# THEN: add a scatter plot
geom_point()
```

If you want lower layers to be visible behind layers added later, one possibility is to tinker with opacity, via the `alpha`

parameter. Notice that the example below also changes the colors. The result is quite toxic, but at least you see the line underneath the semi-transparent points.

```
%>%
avocado_data ggplot(
mapping = aes(
# notice that we use the log (try without it to understand why)
x = log(total_volume_sold),
y = average_price
)+
) # FIRST: add a linear regression line
geom_smooth(method = "lm", color = "darkgreen") +
# THEN: add a scatter plot
geom_point(alpha = 0.1, color = "orange")
```

The aesthetics defined in the initial call to `ggplot`

are global defaults for all layers to follow, unless they are overwritten. This also holds for the data supplied to `ggplot`

. For example, we can create a second layer using another call to `geom_point`

from a second data set (e.g., with a summary statistic), like so:

```
# create a small tibble with the means of both
# variables of interest
<-
avocado_data_means %>%
avocado_data summarize(
mean_volume = mean(log(total_volume_sold)),
mean_price = mean(average_price)
) avocado_data_means
```

```
## # A tibble: 1 × 2
## mean_volume mean_price
## <dbl> <dbl>
## 1 11.3 1.41
```

```
%>%
avocado_data ggplot(
aes(x = log(total_volume_sold),
y = average_price)
+
) # first layer uses globally declared data & mapping
geom_point() +
# second layer uses different data set & mapping
geom_point(
data = avocado_data_means,
mapping = aes(
x = mean_volume,
y = mean_price
), # change shape of element to display (see below)
shape = 9,
# change size of element to display
size = 12,
color = "skyblue"
)
```

Categorical distinction is frequently important in data analysis. Just think of the different combinations of factor levels in a factorial design, or the difference between conventionally grown and organically grown avocados. `ggplot`

understands grouping very well and acts on appropriately, if you tell it to in the right way.

Grouping can be relevant for different aspects of a plot: the color of points or lines, their shape, or even whether to plot everything together or separately. For instance, we might want to display different types of avocados in a different color. We can do this like so:

```
%>%
avocado_data ggplot(
aes(
x = log(total_volume_sold),
y = average_price,
# use a different color for each type of avocado
color = type
)+
) geom_point()
```

Notice that we added the grouping information inside of `aes`

to the call of `ggplot`

. This way the grouping is the global default for the whole plot. Check what happens when we then add another layer, like `geom_smooth`

:

```
%>%
avocado_data ggplot(
aes(
x = log(total_volume_sold),
y = average_price,
# use a different color for each type of avocado
color = type
)+
) geom_point() +
geom_smooth(method = "lm")
```

The regression lines will also be shown in the colors of the underlying scatter plot. We can change this by overwriting the `color`

attribute locally, but then we lose the grouping information:

```
%>%
avocado_data ggplot(
aes(
x = log(total_volume_sold),
y = average_price,
# use a different color for each type of avocado
color = type
)+
) geom_point() +
geom_smooth(method = "lm", color = "black")
```

To retrieve the grouping information, we can change the explicit keyword `group`

(which just treats data from the relevant factor levels differently without directly changing their appearance):

```
%>%
avocado_data ggplot(
aes(
x = log(total_volume_sold),
y = average_price,
# use a different color for each type of avocado
color = type
)+
) geom_point() +
geom_smooth(
# tell the smoother to deal with avocados types separately
aes(group = type),
method = "lm",
color = "black"
)
```

Finally, we see that the lines are not uniquely associable with the avocado type, so we can also change the regression line’s `shape`

attribute conditional on avocado type:

```
%>%
avocado_data ggplot(
aes(
x = log(total_volume_sold),
y = average_price,
# use a different color for each type of avocado
color = type
)+
) geom_point() +
geom_smooth(
# tell the smoother to deal with avocados types separately
aes(group = type, linetype = type),
method = "lm",
color = "black"
)
```

If done with the proper mind and heart, plots intended to share (and to communicate a point, following the idea of hypothesis-driven visualization) will usually require a lot of tweaking. We will cover some of the most frequently relevant tweaks in Section 6.6.

To nevertheless get a feeling of where the journey is going, at least roughly, here is an example of a plot of the avocado data which is much more tweaked and honed. No claim is intended regarding the false idea that this plot is in any sense optimal. There is not even a clear hypothesis or point to communicate. This just showcases some functionality. Notice, for instance, that this plot uses two layers, invoked by `geom_point`

which shows the scatter plot of points and `geom_smooth`

which layers on top the point cloud regression lines (one for each level in the grouping variable).

```
# pipe data set into function `ggplot`
%>%
avocado_data # reverse factor level so that horizontal legend entries align with
# the majority of observations of each group in the plot
mutate(
type = fct_rev(type)
%>%
) # initialize the plot
ggplot(
# defined mapping
mapping = aes(
# which variable goes on the x-axis
x = total_volume_sold,
# which variable goes on the y-axis
y = average_price,
# which groups of variables to distinguish
group = type,
# color and fill to change by grouping variable
fill = type,
linetype = type,
color = type
)+
) # declare that we want a scatter plot
geom_point(
# set low opacity for each point
alpha = 0.1
+
) # add a linear model fit (for each group)
geom_smooth(
color = "black",
method = "lm"
+
) # change the default (normal) of x-axis to log-scale
scale_x_log10() +
# add dollar signs to y-axis labels
scale_y_continuous(labels = scales::dollar) +
# change axis labels and plot title & subtitle
labs(
x = 'Total volume sold (on a log scale)',
y = 'Average price',
title = "Avocado prices against amount sold",
subtitle = "With linear regression lines"
)
```

**Exercise 6.1: Find the match**

Determine which graph was created with which code:

Code 1:

```
<- ggplot(avocado_data,
code_1 mapping = aes(
x = average_price,
y = log(total_volume_sold),
color = type
)+
) geom_point() +
geom_smooth(method = "lm")
```

Code 2:

```
<- ggplot(avocado_data,
code_2 mapping = aes(
x = log(total_volume_sold),
y = average_price,
color = type
)+
) geom_point() +
geom_smooth(method = "lm")
```

Code 3:

```
<- ggplot(avocado_data,
code_3 mapping = aes(
x = log(total_volume_sold),
y = average_price
)+
) geom_smooth(method = "lm", color = "black") +
geom_point(alpha = 0.1, color = "blue") +
labs(
x = 'Total volume sold (on a log scale)',
y = 'Average price',
title = "Avocado prices against amount sold"
)
```

Code 4:

```
<- ggplot(avocado_data,
code_4 mapping = aes(
x = log(total_volume_sold),
y = average_price,
linetype = type
)+
) geom_smooth(method = "lm", color = "black") +
geom_point(alpha = 0.1, color = "blue") +
labs(
x = 'Total volume sold (on a log scale)',
y = 'Average price',
title = "Avocado prices against amount sold"
)
```

Plot 1:

Plot 2:

Plot 3:

Plot 1: Code 4

Plot 2: Code 1

Plot 3: Code 2

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” *Journal of Computational and Graphical Statistics* 19 (1): 3–28.

If you run this code for yourself, the output is likely to look different from what is shown here. This is because this web-book uses a default theme for all of its plots. We will come back to customization with themes later.↩︎

Notice that, as soon as we add the linear regression line, it makes sense to use the logarithm of

`total_volume_sold`

because otherwise, the fit is quite ridiculous. The logarithm helps to spread out the large number of data points where`total_volume_sold`

is very low, and to “bring back to the flock” the data points where`total_volume_sold`

is outliery high. It can be quite useful to use such transformations, if they are well understood. It is controversial whether such transformations should precede statistical analyses, but that is not important right now.↩︎