Sheet 2.2: ML-estimation#
Author: Michael Franke
This tutorial is meant to introduce some basics of PyTorch by looking at a simple case study: how to find the best-fitting parameter for the mean of a normal (Gaussian) distribution. The training data is a set of samples from a “true” distribution. The loss function is the negative likelihood that a candidate parameter value for the “true mean” assigns to the training data. By using stochastic gradient descent to minimize the loss, we seek the parameter value that maximizes the likelihood of the training data. This is, therefore, a maximum likelihood estimation.
Packages#
We will need to import the `torch` package for the main functionality. We also will use `seaborn` for plotting, and `matplotlib` for showing the plots. Finally, we use the `warnings` package to suppress all warning messages in the notebook.
import torch
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
True distribution & training data#
The “true distribution” that generates the data is a normal distribution with a mean (location) stored in the variable `trueLocation`. (We keep the scale parameter (standard deviation) fixed at a known value of 1.) The `torch.distributions` package contains ready-made probability distributions for sampling. So, here we define the true distribution, and take `nObs` samples from it, the set of which we call “training data”.
nObs = 10000
trueLocation = 0 # mean of a normal
trueDist = torch.distributions.Normal(loc=trueLocation, scale=1.0)
trainData = trueDist.sample([nObs])
The mean of the training data is the so-called empirical mean. The empirical mean need not be identical to the true mean!
empirical_mean = torch.mean(trainData)
print('Empirical mean (mean of training data): %.5f' % empirical_mean.item())
Empirical mean (mean of training data): 0.00001
Here is a density plot of the training data:
sns.kdeplot(trainData)
plt.show()
Optimizing a parameter: gradients, optimizers, loss & backprop#
We want an estimate of the true mean of the training data. For that, we define a “trainable” parameter in PyTorch, which we set to some initial value. Subsequently, we will update the value of this parameter in a series of training steps, so that it will become “better” over time. Being “good” means having a small “loss”. The loss function we are interested in is the likelihood of the training data.
To being with, we define the parameter which is to be trained. Since we want to “massage it” through updating, we must tell PyTorch that it should compute the gradient for this parameter (and the ones derived from it). (NB: For numerical stability we require this parameter to be a 64 bit float. You may try out the exercises below with the default float32 format and compare.)
location = torch.tensor(1.0, requires_grad=True, dtype=torch.float64)
print( location)
tensor(1., requires_grad=True)
To prepare for training, we first instantiate an optimizer, which will do the updating behind the scenes. Here, we choose the stochastic gradient descent (SGD) optimizer. To instantiate it, we need to tell it two things:
which parameters to optimize;
how aggressively to update (=> the so-called learning rate)
learningRate = 0.0000001
opt = torch.optim.SGD([location], lr=learningRate)
Let us now go manually through a single training step. A training step consists of the following parts:
compute the predictions for the current parameter(s)
what do we predict in the current state?
compute the loss for this prediction
how good is this prediction (for the training data)?
backpropagate the error (using the gradients)
in which direction would we need to change the relevant parameters to make the prediction better?
update step
change the parameters (to a certain degree, the so-called learning rate) in the direction that should make them better
zero the gradient
reset the information about “which direction to tune” for the next training step
Part 1: Compute the predictions for current parameter value#
The prediction for the current parameter value is a Gaussian with the location parameter set to our current parameter value. We obtain our “current best model” by instantiating a distribution like so:
prediction = torch.distributions.Normal(loc=location, scale=1.0)
Part 2: Computing the loss for the current prediction#
How good is our current model? Goodness can be measured in many ways. Here we consider the likelihood: how likely is the training data under the current model?
loss = -torch.sum(prediction.log_prob(trainData))
print(loss)
tensor([-1.6429, -2.2677, -0.9448, ..., -1.3587, -0.9517, -0.9194],
grad_fn=<SubBackward0>)
Notice that the `loss` variable is a single-numbered tensor (containing the information how bad (we want to minimize it) the current parameter value is). Notice that PyTorch has also added information on how to compute gradients, i.e., it keeps track of way in which values for the variable `location` influence the values for the variable `loss`.
Part 3: Backpropagate the error signal#
In the next step, we will use the information stored about the functional relation between `location` and `loss` to infer how the `location` parameter would need to be changed to make `loss` higher or lower. This is the so-called backpropagation step.
Concretely, at the outset, the gradient information for `location` is “NONE”.
print(f"Value (initial) = { location.item()}")
print(f"Gradient information (initial) = { location.grad}")
Value (initial) = 1.0
Gradient information (initial) = None
We must actively tell the system to backpropagate the information in the gradients, like so:
loss.backward()
print(f"Value (after backprop) = { location.item()}")
print(f"Gradient information (after backprop) = { location.grad}")
Value (after backprop) = 1.0
Gradient information (after backprop) = 9800.26171875
Part 4: Update the parameter values#
Next, we use the information in the gradient to actually update the trainable parameter values. This is what the optimizer does. It knows which parameters to update (we told it), so the relevant update function is one associated with the optimizer itself.
opt.step()
print(f"Value (after step) = { location.item()}")
print(f"Gradient information (after step) = { location.grad}")
Value (after step) = 0.999019980430603
Gradient information (after step) = 9800.26171875
Part 5: Reset the gradient information#
If we want to repeat the updating process, we need to erase information about gradients for the last prediction. This is because otherwise information would just accumulate in the gradients. This zero-ing of the gradients is again something we do holistically (for all parameters to train) through the optimizer object:
opt.zero_grad()
print(f"Value (after zero-ing) = { location.item()}")
print(f"Gradient information (after zero-ing) = { location.grad}")
Value (after zero-ing) = 0.999019980430603
Gradient information (after zero-ing) = 0.0
Training loop#
After having gone through our cycle of parameter updating step-by-step, let’s iterate this in a training loop consisting of `nTrainingSteps`.
nTrainingSteps= 10000
print('\n%5s %24s %15s %15s' %
("step", "loss", "estimate", "diff. target") )
for i in range(nTrainingSteps):
prediction = torch.distributions.Normal(loc=location, scale=1.0)
loss = -torch.sum(prediction.log_prob(trainData))
loss.backward()
if (i+1) % 500 == 0:
print('%5d %24.3f %15.5f %15.5f' %
(i + 1, loss.item(), location.item(),
abs(location.item() - empirical_mean) ) )
opt.step()
opt.zero_grad()
#+begin_example
step loss estimate diff. target
500 16027.112 0.60699 0.60698
1000 14862.321 0.36807 0.36806
1500 14434.033 0.22319 0.22318
2000 14276.555 0.13534 0.13533
2500 14218.649 0.08207 0.08206
3000 14197.358 0.04977 0.04976
3500 14189.528 0.03018 0.03017
4000 14186.650 0.01830 0.01830
4500 14185.593 0.01110 0.01110
5000 14185.203 0.00673 0.00673
5500 14185.061 0.00409 0.00408
6000 14185.007 0.00248 0.00247
6500 14184.988 0.00151 0.00150
7000 14184.981 0.00092 0.00091
7500 14184.979 0.00056 0.00055
8000 14184.978 0.00034 0.00033
8500 14184.977 0.00021 0.00020
9000 14184.977 0.00013 0.00012
9500 14184.978 0.00008 0.00007
10000 14184.977 0.00005 0.00005
#+end_example
Exercise 2.2.1: Explore the optimization process
This exercise is intended to make you play around with the parameters of the training procedure, namely `learningRate` and `nTrainingSteps`, and to develop a feeling for what they do. There is not necessarily a single “true” solution. Report the values that you found to work best for each of the following cases:
Change the initial value of the parameter `location` to -5000.
Revert to initial conditions. Change the true mean (parameter `trueLocation`) to 5000.
Revert to initial conditions. Use only 100 samples for the training set (using variable `nObs`).