5 Summary statistics


badge summary statistics

A summary statistic is a single number that represents one aspect of a possibly much more complex chunk of data. This single number might, for example, indicate the maximum or minimum value of a vector of one billion observations. The large data set (one billion observations) is reduced to a single number which represents one aspect of that data. Summary statistics are, as a general (but violable) rule, many-to-one surjections. They compress complex information into a simpler, compressed representation.

Summary statistics are useful for understanding the data at hand, for communication about a data set, but also for subsequent statistical analyses. As we will see later on, many statistical tests look at a summary statistic \(x\), which is a single value derived from data set \(D\), and compare \(x\) to an expectation of what \(x\) should be like if the process that generated \(D\) really had a particular property. For the moment, however, we use summary statistics only to get comfortable with data: understanding it better and gaining competence to manipulate it.

Section 5.1 first uses the Bio-Logic Jazz-Metal data set to look at a very intuitive class of summary statistics for categorical data, namely counts and proportions. Section 5.2 introduces summary statistics for simple, one-dimensional vectors with numeric information. Section 5.3 looks at measures of the relation between two numerical vectors, namely covariance and correlation. These last two sections use the avocado data set.

The learning goals for this chapter are:

  • become able to compute counts and frequencies for categorical data
  • understand and be able to compute summary statistics for one-dimensional metric data:
    • measures of central tendency
      • mean, mode, median
    • measures of dispersion
      • variance, standard deviation, quantiles
    • non-parametric estimates of confidence
      • bootstrapped CI of the mean
  • understand and be able to compute for two-dimensional metric data:
    • covariance
    • Bravais-Pearson correlation