8.1 Statistical models
In its most common natural sense, a “model” is a model of something. It intends to represent something else in a condensed, abstract, and more practical form; where what is practical is conditioned by a given purpose. For any given purpose, a good model will try to represent some aspects of reality and abstract away from irrelevant features that might otherwise blur our vision. The most common purpose of a statistical model is to either learn something about reality by drawing inferences from data - possibly with the ulterior goal of making an informed practical decision - or to make predictions about unknown events (future, present or past unknowns).
A statistical model \(M\) is a model of a random process \(R\) that could have generated some kind of observable data that we are interested in.43 The model \(M\) is then a formally precise formulation of our assumptions about this random process \(R\).
Often, we want to explain some part of our data observations, the dependent variable(s) \(D_{\text{DV}}\), in terms of some other observations, the independent variables \(D_{\text{IV}}\) (see Chapter 3.3 for more on the notion of (in-)dependent variables). But it is also possible that there are no independent variables in terms of which we would like to model the dependent variable \(D_{\text{DV}}\).
A model \(M\) for data \(D\) fixes a likelihood function for \(D_\text{DV}\). The likelihood function determines how likely any potential data observation \(D_\text{DV}\) is, given the corresponding observations in \(D_\text{IV}\). Most often, the likelihood function also has free parameters, represented by a parameter vector \(\theta\). The basic (and yet rather uninformative) notation for a likelihood function of model \(M\) for data \(D\) with parameter vector \(\theta\) is therefore:44
\[ P_M(D_\text{DV} \mid D_\text{IV}, \theta) \]
Bayesian models have an additional component, namely a prior distribution over parameter values, commonly written as:
\[ P_M(\theta) \].
The Bayesian prior over parameter values can be used to regularize inference and/or to represent any motivated and justifiable a priori assumptions about parameter values that are plausible given our knowledge so far. Section 8.3 elaborates on parameters and priors. But first, we should take a look at an example, which we will use in the remainder of this chapter for further illustration.
Example: Binomial Model. The data we are interested in comes from a sequence of flips of a coin with bias \(\theta_c \in [0;1]\). We have observed that \(k\) of the \(N\) flips turned out to be heads. We know \(N\) and \(k\), but we do not know \(\theta_c\). We will use the Binomial Model in later sections to infer the latent (= not directly observable) coin bias \(\theta_c\).
The coin’s bias \(\theta_c\) is the only parameter of this model. The dependent variable is \(k\). \(N\) is another data observation (treated here as an independent variable45).
The likelihood function for this model is the Binomial distribution:
\[ P_M(k \mid \theta_c, N) = \text{Binomial}(k, N, \theta_c) = \binom{N}{k}\theta_c^k(1-\theta_c)^{N-k} \]
For reasons that will become clear later, we use a Beta distribution for the prior of \(\theta_c\). For example, we can use parameters so that the ensuing distribution is flat (a so-called “uninformative prior”; more on this below):
\[ P_M(\theta_c) = \text{Beta}(\theta_c, 1, 1) \]
There are three main uses for models in statistical data analysis:
- Prediction: Models can be used to make predictions about future or hypothetical data observations. We will see an example of this in Section 8.3 in this chapter.
- Parameter estimation: Based on model \(M\) and data \(D\), we try to infer which value of the parameter vector \(\theta\) we should believe in or work with (e.g., base our decision on). Parameter estimation can also serve knowledge gain, especially if (some component of) \(\theta\) is theoretically interesting. We will deal with parameter estimation in Chapter 9.
- Model comparison: If we formulate at least two alternative models, we can ask which model better explains or better predicts some data. In some of its guises, model comparison helps with the question of whether a given data set provides evidence in favor of one model and against another other, and if so, how much. Model comparison is the topic of Chapter 10.
In most common parlance, however, we often speak of “a model of the data” or of “modeling the data”, but this is only sloppy slang for “a model of (what we assume is) a random process that could generate data of the relevant kind”.↩︎
Since in many contexts the meaning will be clear enough, we follow the common practice and write \(P(D \mid \theta)\) as a shortcut for the, strictly speaking, correct but cumbersome \(P(\mathcal{D} = D \mid \Theta = \theta)\). In this latter notation, \(\mathcal{D}\) is the class of all relevant observable data and \(\Theta\) is the range of a possibly high-dimensional vector of parameter values. We diverge from the common practice of using capital roman letters for random variables and small roman letters for values from these random variables because parameter vectors are traditionally written as \(\theta\) and the small letter \(\textrm{d}\) (albeit non-italic) is reserved for differentials.↩︎
It is fair to treat \(N\) as an independent variable if it was determined at the beginning of the experiment (= sequence of flips), so that the only dependent measure is the number \(k\) of head outcomes for fixed \(N\).↩︎