Sheet 2.2: ML-estimation#

Author: Michael Franke

This tutorial is meant to introduce some basics of PyTorch by looking at a simple case study: how to find the best-fitting parameter for the mean of a normal (Gaussian) distribution. The training data is a set of samples from a “true” distribution. The loss function is the negative likelihood that a candidate parameter value for the “true mean” assigns to the training data. By using stochastic gradient descent to minimize the loss, we seek the parameter value that maximizes the likelihood of the training data. This is, therefore, a maximum likelihood estimation.

Packages#

We will need to import the `torch` package for the main functionality. We also will use `seaborn` for plotting, and `matplotlib` for showing the plots. Finally, we use the `warnings` package to suppress all warning messages in the notebook.

import torch
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

True distribution & training data#

The “true distribution” that generates the data is a normal distribution with a mean (location) stored in the variable `trueLocation`. (We keep the scale parameter (standard deviation) fixed at a known value of 1.) The `torch.distributions` package contains ready-made probability distributions for sampling. So, here we define the true distribution, and take `nObs` samples from it, the set of which we call “training data”.

nObs           = 10000
trueLocation   = 0 # mean of a normal
trueDist       = torch.distributions.Normal(loc=trueLocation, scale=1.0)
trainData      = trueDist.sample([nObs])

The mean of the training data is the so-called empirical mean. The empirical mean need not be identical to the true mean!

empirical_mean = torch.mean(trainData)
print('Empirical mean (mean of training data): %.5f' % empirical_mean.item())
Empirical mean (mean of training data): 0.00001

Here is a density plot of the training data:

sns.kdeplot(trainData)
plt.show()
../_images/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png

Optimizing a parameter: gradients, optimizers, loss & backprop#

We want an estimate of the true mean of the training data. For that, we define a “trainable” parameter in PyTorch, which we set to some initial value. Subsequently, we will update the value of this parameter in a series of training steps, so that it will become “better” over time. Being “good” means having a small “loss”. The loss function we are interested in is the likelihood of the training data.

To being with, we define the parameter which is to be trained. Since we want to “massage it” through updating, we must tell PyTorch that it should compute the gradient for this parameter (and the ones derived from it). (NB: For numerical stability we require this parameter to be a 64 bit float. You may try out the exercises below with the default float32 format and compare.)

location = torch.tensor(1.0, requires_grad=True, dtype=torch.float64)
print( location)
tensor(1., requires_grad=True)

To prepare for training, we first instantiate an optimizer, which will do the updating behind the scenes. Here, we choose the stochastic gradient descent (SGD) optimizer. To instantiate it, we need to tell it two things:

  1. which parameters to optimize;

  2. how aggressively to update (=> the so-called learning rate)

learningRate = 0.0000001
opt          = torch.optim.SGD([location], lr=learningRate)

Let us now go manually through a single training step. A training step consists of the following parts:

  1. compute the predictions for the current parameter(s)

    • what do we predict in the current state?

  2. compute the loss for this prediction

    • how good is this prediction (for the training data)?

  3. backpropagate the error (using the gradients)

    • in which direction would we need to change the relevant parameters to make the prediction better?

  4. update step

    • change the parameters (to a certain degree, the so-called learning rate) in the direction that should make them better

  5. zero the gradient

    • reset the information about “which direction to tune” for the next training step

Part 1: Compute the predictions for current parameter value#

The prediction for the current parameter value is a Gaussian with the location parameter set to our current parameter value. We obtain our “current best model” by instantiating a distribution like so:

prediction = torch.distributions.Normal(loc=location, scale=1.0)

Part 2: Computing the loss for the current prediction#

How good is our current model? Goodness can be measured in many ways. Here we consider the likelihood: how likely is the training data under the current model?

loss     = -torch.sum(prediction.log_prob(trainData))
print(loss)
tensor([-1.6429, -2.2677, -0.9448,  ..., -1.3587, -0.9517, -0.9194],
       grad_fn=<SubBackward0>)

Notice that the `loss` variable is a single-numbered tensor (containing the information how bad (we want to minimize it) the current parameter value is). Notice that PyTorch has also added information on how to compute gradients, i.e., it keeps track of way in which values for the variable `location` influence the values for the variable `loss`.

Part 3: Backpropagate the error signal#

In the next step, we will use the information stored about the functional relation between `location` and `loss` to infer how the `location` parameter would need to be changed to make `loss` higher or lower. This is the so-called backpropagation step.

Concretely, at the outset, the gradient information for `location` is “NONE”.

print(f"Value (initial)                = { location.item()}")
print(f"Gradient information (initial) = { location.grad}")
Value (initial)                = 1.0
Gradient information (initial) = None

We must actively tell the system to backpropagate the information in the gradients, like so:

loss.backward()
print(f"Value (after backprop)                = { location.item()}")
print(f"Gradient information (after backprop) = { location.grad}")
Value (after backprop)                = 1.0
Gradient information (after backprop) = 9800.26171875

Part 4: Update the parameter values#

Next, we use the information in the gradient to actually update the trainable parameter values. This is what the optimizer does. It knows which parameters to update (we told it), so the relevant update function is one associated with the optimizer itself.

opt.step()
print(f"Value (after step)                = { location.item()}")
print(f"Gradient information (after step) = { location.grad}")
Value (after step)                = 0.999019980430603
Gradient information (after step) = 9800.26171875

Part 5: Reset the gradient information#

If we want to repeat the updating process, we need to erase information about gradients for the last prediction. This is because otherwise information would just accumulate in the gradients. This zero-ing of the gradients is again something we do holistically (for all parameters to train) through the optimizer object:

opt.zero_grad()
print(f"Value (after zero-ing)                = { location.item()}")
print(f"Gradient information (after zero-ing) = { location.grad}")
Value (after zero-ing)                = 0.999019980430603
Gradient information (after zero-ing) = 0.0

Training loop#

After having gone through our cycle of parameter updating step-by-step, let’s iterate this in a training loop consisting of `nTrainingSteps`.

nTrainingSteps= 10000
print('\n%5s %24s %15s %15s' %
      ("step", "loss", "estimate", "diff. target") )
for i in range(nTrainingSteps):
    prediction = torch.distributions.Normal(loc=location, scale=1.0)
    loss       = -torch.sum(prediction.log_prob(trainData))
    loss.backward()
    if (i+1) % 500 == 0:
        print('%5d %24.3f %15.5f %15.5f' %
              (i + 1, loss.item(), location.item(),
               abs(location.item() - empirical_mean) ) )
    opt.step()
    opt.zero_grad()
#+begin_example

 step                     loss        estimate    diff. target
  500                16027.112         0.60699         0.60698
 1000                14862.321         0.36807         0.36806
 1500                14434.033         0.22319         0.22318
 2000                14276.555         0.13534         0.13533
 2500                14218.649         0.08207         0.08206
 3000                14197.358         0.04977         0.04976
 3500                14189.528         0.03018         0.03017
 4000                14186.650         0.01830         0.01830
 4500                14185.593         0.01110         0.01110
 5000                14185.203         0.00673         0.00673
 5500                14185.061         0.00409         0.00408
 6000                14185.007         0.00248         0.00247
 6500                14184.988         0.00151         0.00150
 7000                14184.981         0.00092         0.00091
 7500                14184.979         0.00056         0.00055
 8000                14184.978         0.00034         0.00033
 8500                14184.977         0.00021         0.00020
 9000                14184.977         0.00013         0.00012
 9500                14184.978         0.00008         0.00007
10000                14184.977         0.00005         0.00005
#+end_example

Exercise 2.2.1: Explore the optimization process

This exercise is intended to make you play around with the parameters of the training procedure, namely `learningRate` and `nTrainingSteps`, and to develop a feeling for what they do. There is not necessarily a single “true” solution. Report the values that you found to work best for each of the following cases:

  1. Change the initial value of the parameter `location` to -5000.

  2. Revert to initial conditions. Change the true mean (parameter `trueLocation`) to 5000.

  3. Revert to initial conditions. Use only 100 samples for the training set (using variable `nObs`).