3.1 What is data?

Some say we live in the data age. But what is data actually? Purist pedants say: “The plural of datum” and add that a datum is just an observation. But when we say “data”, we usually mean a bit more than a bunch of observations. The observation that Jones had apple and banana for breakfast, is maybe interesting but not what we usually call “data”.

The Merriam-Webster offers the following definition:

Factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.

This is a teleological definition in the sense that it refers to a purpose: data is something that is “used as basis for reasoning, discussion, or calculation”. So, what we mean by “data” is, in large part, defined by what we intend to do with it. Another important aspect of this definition is that we usually consider data to be systematically structured in some way or another. Even when we speak of “raw data”, we expect there to be some structure (maybe labels, categories etc.) that distinguishes data from uninterpretable noise (e.g., the notion of a “variable”, discussed in Section 3.3). In sum, we can say that data is a representation of information stored in a systematic way for the purpose of inference, argument or decision making.

Let us consider an example of data from a behavioral experiment, namely the King of France experiment. It is not important to know about this experiment for now. We just want to have a first glimpse at how data frequently looks like. Using R (in ways that we will discuss in the next chapter), we can show the content of part of the data as follows:

## # A tibble: 6 × 4
##   submission_id trial_number trial_type response
##           <dbl>        <dbl> <chr>      <lgl>   
## 1           192            1 practice   FALSE   
## 2           192            2 practice   TRUE    
## 3           192            3 practice   FALSE   
## 4           192            4 practice   TRUE    
## 5           192            5 practice   TRUE    
## 6           192            1 filler     TRUE

We see that the data is represented as a tibble and that there are different kinds of column with different kinds of information. The submission_id is an anonymous identifier for the person whose data is shown here. The trial_number is a consecutive numbering of the different stages of the experiment (at each of which the participant gave one response, listed in the response column). The trial_type tells us which kind of trial each observation is from. There are more columns in this data set, but this is just for a first, rough impression of how “data” might look like. The most important thing to see here is that, following the definition above, data is “information stored in a systematic way”.