3.2 Different kinds of data

There are different kinds of data. Figure 3.1 shows some basic distinctions, represented in a conceptual hierarchy.

Hierarchy of different kinds of data relevant for 'data science'.

Figure 3.1: Hierarchy of different kinds of data relevant for ‘data science’.

It is easy but wrong to think that data always has to be information based on observations of the world. It is easy to think this because empirical data, i.e., data obtained from empirical observation, is the most common form of data (given that it is, arguably, most relevant for decision making and argument). But it is wrong to think this because we can just as well look at virtual data. For example, virtual data, which is of interest to a data analyst, could be data obtained from computer simulation studies, e.g., from, say, one billion runs of a multi-agent simulation intended to shed light on the nature of cooperative interaction. It makes sense to analyze such data with the same tools as data from an experiment. For instance, we might find out that some parameter constellations in the simulation run are (statistically) most conducive to producing cooperative behavior among our agents. Another example of virtual data is data generated as predictions of a model, which we can use to test whether that model is any good, in so-called model criticism.14 Finally, we should also include logically possible sample data in this list, because of its importance to central ideas of statistical inference (especially \(p\)-values, see Section 16). Logically possible sample data are those that were neither observed nor predicted by a model, but something that could have been observed hypothetically, something that it is merely logically possible to observe, even if it would almost never happen in reality or would not be predicted by any serious model.

The most frequent form of data, empirical data about the actual world, comes in two major variants. Observational data is data gathered by (passively) observing and recording what would have happened even if we had not been interested in it, so to speak. Examples of observational data are collections of socio-economic variables, like gender, education, income, number of children, etc. In contrast, experimental data is data recorded in a strict regime of manipulation-and-observation, i.e., a scientific experiment. Some pieces of information can only be recorded in an observational study (annual income), and others can only be obtained through experimentation (memory span). Both methods of data acquisition have their own pros and cons. Here are some of the more salient ones:

Table 3.1: Comparison of the pros and cons of observational data and experimental data.
observational experimental
ecologically valid possibly artificial
easy/easier to obtain hard/harder to obtain
correlation & causation hard to tease apart may yield information on causation vs. correlation

No matter what kind of data we have at hand, there are at least two prominent purposes for which data can be useful: explanation and prediction. Though related, it is useful to keep these purposes cleanly apart. Data analysis for explanation uses the data to better understand the source of the data (the world, a computer simulation, a model, etc.). Data analysis for prediction tries to extract regularities from the data gathered so far to make predictions (as accurately as possible) about future or hitherto unobserved data.


  1. We will later speak of prior/posterior predictions for this kind of data. Other applicable terms are repeat data or sometimes fake data.↩︎