6 Data Visualization

Numerical summaries of complex data always incur information loss. Still lossy, but less so (if done well), is visualization. Any serious data analysis should start with a process in which the analyst becomes intimate with the data at hand. Visualization is an integral part of data-intimacy.

Section 6.1 demonstrates how summary statistics can be misleading and how a simple visualization can be much more revealing. Section 6.2 offers some reflection on what makes a data visualization successful. Section 6.3 introduces the basics of data visualization with the ggplot package, an integral part of the tidyverse.²⁴ This first exposition is based on a scatter plot for the avocado price data. Going beyond scatter plots, Section 6.4 looks at some common types of plots and how to realize them using the geom_ family of functions in ggplot.

The learning goals for this chapter are:

obtain a basic understanding of better/worse plotting
- understand the idea of hypothesis-driven visualization
develop a basic understanding of the ‘grammar of graphs’
get familiar with frequent visualization strategies
- bar plots, densities, violins, error bars, etc.
be able to fine-tune graphs for better visualization

It is possible to create ggplot-like graphs with similar syntax in Python, as described here. The Gadfly package for Julia uses very similar ideas (of incremental composition) but a different syntax.↩︎