3.4 Basics of experimental design

The most basic template for an experiment is to just measure a quantity of interest (the dependent variable), without taking into account any kind of variation in any kind of independent variables. For instance, we measure the time it takes for an object with a specific shape and weight to hit the ground when dropped from a height of exactly 2 meters. To filter out measurement noise, we do not just record one observation, but, ideally, as much as we possibly and practically can. We use the measurements, in our concrete example: time measurements, to test a theory about acceleration and gravity. Data from such a simple measurement experiment would be just a single vector of numbers.

A more elaborate kind of experiment would allow for at least one independent variable. Another archetypical example of an empirical experiment would be a medical study, e.g., one in which we are interested in the effect of a particular drug on the blood pressure of patients. We would then randomly allocate each participant to one of two groups. One group, the treatment group, receives the drug in question; the other group, the control group, receives a placebo (and nobody, not even the experimenter, knows who receives what). After a pre-defined exposure to either drug or placebo, blood pressure (for simplicity, just systolic blood pressure) is measured. The interesting question is whether there is a difference between the measurements across groups. This is a simple example of a one-factor design. The factor in question is which group any particular measurement belongs to. Data from such an experiment could look like this:

tribble(
  ~subj_id,     ~group,        ~systolic,   
  1,            "treatment",   118,
  2,            "control",     132,
  3,            "control",     116,
  4,            "treatment",   127,
  5,            "treatment",   122
)

## # A tibble: 5 × 3
##   subj_id group     systolic
##     <dbl> <chr>        <dbl>
## 1       1 treatment      118
## 2       2 control        132
## 3       3 control        116
## 4       4 treatment      127
## 5       5 treatment      122

For the purposes of this course, which is not a course on experimental design, just a few key concepts of experimental design are important to be aware of. We will go through some of these issues in the following.

3.4.1 What to analyze? – Dependent variables

To begin with, it is important to realize that there is quite some variation in what counts as a dependent variable. Not only can there be more than one dependent variable, but each dependent variable can also be of quite a different type (nominal, ordinal, metric, …), as discussed in the previous section. Moreover, we need to carefully distinguish between the actual measurement/observation and the dependent variable itself. The dependent variable is (usually) what we plot, analyze and discuss, but very often, we measure much more or something else. The dependent variable (of analysis) could well just be one part of the measurement. For example, a standard measure of blood pressure has a number for systolic and another for diastolic pressure. Focussing on just one of these numbers is a (hopefully: theoretically motivated; possibly: arbitrary; in the worst case: result-oriented) decision of the analyst. More interesting examples of such data preprocessing frequently arise in the cognitive sciences, for example:

eye-tracking: the measured data are triples consisting of a time-point and two spatial coordinates, but what might be analyzed is just the relative proportion of looks at a particular spatial region of interest (some object on the screen) in a particular temporal region of interest (up to 200 ms after the image appeared)
EEG: individual measurements obtained by EEG are very noisy, so that the dependent measure in many analyses is an aggregation over the mean voltage recorded by selected electrodes, where averages are taken for a particular subject over many trials of the same condition (repeated measures) that this subject has seen

But we do not need to go fancy in our experimental methods to see how issues of data processing affect data analysis at its earliest stages, namely by selecting the dependent variable (that which is to be analyzed). Just take the distinction between closed questions and open questions in text-based surveys. In closed questions, participants select an answer from a finite (usually) small number of choices. In open questions, however, they can write text freely, or they can draw, sing, pronounce, gesture, etc. Open response formats are great and naturalistic, but they, too, often require the analyst to carve out a particular aspect of the (rich, natural) observed reality to enter the analysis.

3.4.2 Conditions, trials, items

A factorial design is an experiment with at least two independent variables, all of which are (ordered or unordered) factors.¹⁶ Many psychological studies are factorial designs. Whole batteries of analysis techniques have been developed specifically tuned to these kinds of experiments.

Factorial designs are often described in terms of short abbreviations. For example, an experiment described as a “\(n \times m\) factorial design” would have two factors of interest, the first of which has \(n\) levels, the second of which has \(m\) levels. For example, a \(2 \times 3\) factorial design could have one independent variable recording a binary distinction between control and treatment group, and another independent variable representing an orthogonal distinction of gender in categories ‘male’, ‘female’ and ‘non-binary’.

For a \(2 \times 2 \times 3\) factorial design, there are 2 * 2 * 3 = 12 different experimental conditions (also sometimes called design cells). An important distinction in experimental design is whether all participants contribute data to all of the experimental conditions, or whether each only contributes to a part of it. If participants only contribute data to a part of all experimental conditions, this is called a between-subjects design. If all participants contribute data to all experimental conditions, we speak of a within-subjects design. Clearly, sometimes the nature of a design factor determines whether the study can be within-subjects. For example, switching gender for the purpose of a medical study on blood pressure drugs is perhaps a tad much to ask of a participant (though possibly a very enlightening experience). If there is room for the experimenter’s choice of study type, it pays to be aware of some of the clear advantages and drawbacks of either method, as listed in Table 3.3.

Table 3.3: Comparison of the pros and cons of between- and within-subjects designs.
between-subjects	within-subjects
no confound between conditions	possible cross-contamination between conditions
more participants needed	fewer participants needed
less associated information for analysis	more associated data for analysis

No matter whether we are dealing with a between- or within-subjects design, another important question is whether each participant gives us only one observation per design cell, or more than one. If participants contribute more than one observation to a design cell, we speak of a repeated-measures design. Such designs are useful as they help separate the signal from the noise (recall the initial example of time measurement from physics). They are also economical because getting several observations worth of relevant data from a single participant for each design cell means that we have to get fewer people to do the experiment (normally).

However, exposing a participant repeatedly to the same experimental condition can be detrimental to an experiment’s purpose. Participants might recognize the repetition and develop quick coping strategies to deal with the boredom, for example. For this reason, repeated-measures designs usually include different kinds of trials:

Critical trials belong to, roughly put, the actual experiment, e.g., one of the experiment’s design cells.
Filler trials are packaged around the critical trials to prevent blatant repetition, predictability or recognition of the experiment’s purpose.
Control trials are trials whose data is not used for statistical inference but for checking the quality of the data (e.g., attention checks or tests of whether a participant understood the task correctly).

When participants are exposed to several different kinds of trials and even several instances of the same experimental condition, it is also often important to introduce some variability between the instances of the same types of trials. Therefore, psychological experiments often use different items, i.e., different (theoretically exchangeable) instantiations of the same (theoretically important) pattern. For example, if a careful psycholinguist designs a study on the processing of garden-path sentences, she will include not just one example (“The horse raced past the barn fell”) but several (e.g., “Since Jones frequently jogs a mile is a short distance to her”). Item-variability is also important for statistical analyses, as we will see when we talk about hierarchical modeling.

In longer experiments, especially within-subjects repeated-measures designs in which participants encounter a lot of different items for each experimental condition, clever regimes of randomization are important to minimize the possible effect of carry-over artifacts, for example. A frequent method is pseudo-randomization, where the trial sequence is not completely arbitrary but arbitrary within certain constraints, such as a particular block design, where each block presents an identical number of trials of each type, but each block shuffles the sequence of its types completely at random.

The complete opposite of a within-participants repeated measures design is a so-called single-shot experiment in which any participant gives exactly one data point for one experimental condition.

3.4.3 Sample size

A very important question for experimental design is that of the sample size: how many data points do we need (per experimental condition)? We will come back to this issue only much later in this course when we talk about statistical inference. This is because the decision of how many, say, participants to invite for a study should ideally be influenced not by the available time and money, but also by statistical considerations of the kind: how many data points do I need in order to obtain a reasonable level of confidence in the resulting statistical inferences I care about?

Exercise 3.2: Experimental Design

Suppose that we want to investigate the effect of caffeine ingestion and time of day on reaction times in solving simple math tasks.

The following table shows the measurements of two participants:

## # A tibble: 12 × 4
##    subject_id `RT (ms)` caffeine `time of day`
##         <dbl>     <dbl> <chr>    <chr>        
##  1          1     43490 none     morning      
##  2          1     35200 medium   morning      
##  3          1     33186 high     morning      
##  4          1     26350 none     afternoon    
##  5          1     27004 medium   afternoon    
##  6          1     26492 high     afternoon    
##  7          2     42904 none     morning      
##  8          2     36129 medium   morning      
##  9          2     30340 high     morning      
## 10          2     28455 none     afternoon    
## 11          2     40593 medium   afternoon    
## 12          2     23992 high     afternoon

Is this experiment a one-factor or a full factorial design? What is/are the factor(s)? How many levels does each factor have?

This experiment is a \(3 \times 2\) full factorial design. It has two factors, caffeine (levels: none, medium, high) and time of day (levels: morning, afternoon).

How many experimental conditions are there?

There are 3 * 2 = 6 different experimental conditions.

Is it a between- or within-subjects design?

Within-subjects design (each participant contributes data to all experimental conditions).

What is the dependent variable, what is/are the independent variable(s)?

Dependent variable: RT (the reaction time)
Independent variable 1: caffeine (the caffeine dosage)
Independent variable 2: time of day

Is this experiment a repeated measures design? Explain your answer.

No, each participant contributes exactly one data point per design cell.

The archetypical medical experiment discussed above is a one-factor design. In contrast, the term ‘factorial design’ is usually used to refer to what is also often called a full factorial design. These are designs with at least two independent variables.↩︎