12.2 A maximum-likelihood approach

In order to be able to extend regression modeling to predictor variables other than metric variables (so-called generalized linear regression models, see Chapter 15), the geometric approach needs to be abandoned in favor of a likelihood-based approach. The likelihood-based approach tries to find coefficients that explain the observed data most plausibly.

12.2.1 A likelihood-based model

There are two equivalent formulations of a (simple) linear regression model using a likelihood-based approach. The first is more explicit, showing clearly that the model assumes that for each observation \(y_i\) there is an error term \(\epsilon_i\), which is an iid sample from a Normal distribution. (Notice that the likelihood-based model assumes an additional parameter \(\sigma\), the standard deviation of the error terms.)

\[ \text{likelihood-based regression } \text{[explicit version]} \]

\[ \begin{aligned} \xi & = X \beta \\ y_i & = \xi + \epsilon_i \\ \epsilon_i & \sim \text{Normal}(0, \sigma) \\ \end{aligned} \]

The second, equivalent version of this writes this more compactly, suppressing the explicit mentioning of iid error terms:

\[ \text{likelihood-based regression } \text{[compact version]} \]

\[ \begin{aligned} y_i & \sim \text{Normal}((X \beta)_i, \sigma) \end{aligned} \]

12.2.2 Finding the MLE-solution with `optim`

We can use optim to find maximum likelihood estimates for the simple linear regression of murder_rates predicted in terms of unemployment like so:

# data to be explained / predicted
y <- murder_data %>% pull(murder_rate)
# data to use for prediction / explanation
x <- murder_data %>% pull(unemployment)
# function to calculate negative log-likelihood
get_nll = function(y, x, beta_0, beta_1, sd) {
  if (sd <= 0) {return( Inf )}
  yPred = beta_0 + x * beta_1
  nll = -dnorm(y, mean = yPred, sd = sd, log = T)
  sum(nll)
}
# finding MLE
fit_lh = optim(par = c(0, 1, 1), 
  fn = function(par) {
    get_nll(y, x, par[1], par[2], par[3])
  }
)
# output the results
message(
  "Best fitting parameter values:",
  "\n\tIntercept: ", fit_lh$par[1] %>% round(2),
  "\n\tSlope: ", fit_lh$par[2] %>%  round(2),
  "\nNegative log-likelihood for best fit: ", fit_lh$value %>% round(2)
)

## Best fitting parameter values:
##  Intercept: -28.52
##  Slope: 7.08
## Negative log-likelihood for best fit: 59.9

12.2.3 Finding the MLE-solution with `glm`

R also has a built-in way of approaching simple linear regression with a maximum-likelihood approach, namely by using the function glm (generalized linear model). Notice that the output looks slightly different from that of lm.

fit_glm <- glm(murder_rate ~ unemployment, data = murder_data)
fit_glm

## 
## Call:  glm(formula = murder_rate ~ unemployment, data = murder_data)
## 
## Coefficients:
##  (Intercept)  unemployment  
##       -28.53          7.08  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  18 Residual
## Null Deviance:       1855 
## Residual Deviance: 467.6     AIC: 125.8

12.2.4 Finding the MLE-solution with math

It is no coincidence that these fitted values are (modulo number imprecision) the same as for the geometric OLS approach.

Theorem 12.3 (MLE solution) The vector \(\hat{\beta} \in \mathbb{R}^k\) maximizing the likelihood of a linear regression model with \(k\) predictors is the same as the vector that minimizes the residual sum of squares, namely:

\[ \arg \max_{\beta} \prod_{i = 1}^n \text{Normal}(y_i \mid \mu = (X \beta)_i, \sigma) = (X^T X)^{-1} X^Ty \]

Proof. Using the more explicit formulation of likelihood-based regression, we can rewrite the likelihood function in terms of the probability of “sampling” error terms \(\epsilon_i\) for each \(y_i\) in such a way that \(\epsilon_i = y_i - \xi_i = y_i - (X \beta)_i\):

\[ \begin{align*} LH(\beta) & = \prod_{i = 1}^n \text{Normal}(\epsilon_i \mid \mu = 0, \sigma) \\ & = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi} \sigma} \exp\left[{-\frac{1}{2}\left(\frac{\epsilon_i^2}{\sigma^2}\right)}\right] & \text{[by def. of normal distr.]} \end{align*} \]

Since we are only interested in the maximum of this function, we can also look for the maximum of \(\log LH(\beta)\) because the logarithm is a strictly monotone increasing function. This is useful because the logarithm can then be rewritten as a sum.

\[ \begin{align} LLH(\beta)&=\log \left(LH(\beta)\right)\\ &=-\left( \frac{n}{2}\right) \log(2\pi)-\left( \frac{n}{2}\right) \log(\sigma^2)-\left( \frac{1}{2}\sigma^2\right) \sum_{i=1}^n(\epsilon_i)^2 \tag{2.7} \end{align} \]

Since only the last summand depends on \(\beta\), and since we can drop the factor \(\frac{1}{2}\sigma^2\) for finding a maximum, we obtain:

\[ \arg \max_\beta LLH(\beta) = - \sum_{i=1}^n(\epsilon_i)^2 \]

If we substitute \(\epsilon_i\) and multiply with \(-1\) to find the minimum, we see that we are back at the original problem of finding the OLS solution:

\[ \arg \min_\beta -LLH(\beta) = \sum_{i=1}^n(y_i - (X \beta)_i)^2 \]

Notice that this result holds independently of \(\sigma\), which just canceled out in this derivation.

Exercise 13.3

Let’s assume that following the MLE approach, we obtained \(\beta_0 = 1\), \(\beta_1 = 2\) and \(\sigma = 0.5\). For \(x_i = 0\), which \(\xi_i\) value will maximize the likelihood?

Since \(y_i \sim \text{Normal} (\beta_0 + \beta_1 x_i , \sigma)\), the likelihood is maximized at the mean of the normal distribution, i.e., \(y_i = \beta_0 + \beta_1 x_i = 1\).