12.4 Comparison of approaches

We saw three conceptually different approaches to linear regression: (i) based on ordinary least squares, (ii) based on likelihood alone and (iii) based on Bayesian inference (with non-informative priors). At the core of linear regression lies the linear predictor \(\xi = X \beta\), which all three approaches used. The three approaches differ in how they determine the regression coefficients \(\beta\) that feed this linear predictor. Another crucial difference is in how these different approaches make predictions about new data observations \(y_\text{new}\) given some (hypothetical or actually observed) vector of predictor variables \(x_\text{new}\). Let’s go through these differences with some more eye for detail.

The OLS-based approach determined coefficients based on a geometric notion of distance in terms of squared loss. The prediction of an OLS regression model for a new data set’s dependent variables would just be:57

\[y_\text{new} = \hat\xi = X_\text{new} \hat \beta\]

In words, The OLS-model predicts that \(y_\text{new}\) is given as a point on the best predictor linear regression line. This is a deterministic, very clear-cut point-valued prediction and almost certainly always false. The OLS approach, insofar as we have seen it, does not contain a measure of spread around this best predictor.

The MLE-based approach uses a normal distribution to also quantify the likely spread of observations around the best predictor line. The (posterior) predictions of a trained MLE-based regression model are probabilistic. They are samples from a normal distribution whose central tendency is the best linear predictor:

\[ \begin{align*} \hat \xi & = X_\text{new} \hat \beta \\ y_\text{new} & \sim \text{Normal}(\hat \xi, \hat \sigma) \end{align*} \]

Finally, the Bayesian is even more stochastic, so to speak, than the MLE-based approach. The Bayesian approach does not assume a single best linear predictor vector \(\hat{\xi}\) for its (posterior) predictions, but rather gives us a probability distribution over linear predictors. In vague terms, we could say that Bayesian regression gives us, not a single regression line, but a weighted cloud of (usually: infinitely many) regression lines. A schematic representation of the posterior predictive for the new data point \(y_\text{new}\) given \(x_\text{new}\) in Bayesian regression is:

\[ \begin{align*} \beta_\text{sample}, \sigma_\text{sample} & \sim \text{Bayesian posterior given data} \\ \xi_\text{sample} & = X_\text{new} \beta_\text{sample} \\ y_\text{new} & \sim \text{Normal}(\xi_\text{sample}, \sigma_\text{sample}) \end{align*} \]


  1. Here, \(X_\text{new}\) is the predictor matrix for the new predictor vector \(x_\text{new}\).↩︎