Sheet 6.1: Anatomy of a single LSTM forward pass#

Author: Michael Franke

In this tutorial we will dissect a single forward pass for an LSTM model defined by using PyTorch’s built-in LSTM class. We will not define a whole LSTM class or even train a model, but showcase how raw text data needs to be wrangled so that we can feed it into the ’nn.LSTM’ functionality supplied by PyTorch. We will also learn about the output of that function.

Packages & global parameters#

We import torch and the neural network library.

import torch
import torch.nn as nn
# import time
# import math

Preparing the input data#

Our raw input consists of just two sentences.

# starting point: raw text input
input_raw1 = "Colin was here."
input_raw2 = "Colin went away subito."

Our goal is to represent these sentences eventually as a sequence of integers, such that each integer represents the word at that sequence. In order to achieve this, we need to collect all words (that are relevant) and assign them an integer, e.g., by enumerating them in a list.

We also include some special tokens;

  • [UNK] represents any occurrence of word that the vocabulary does not know (e.g., very rare words, foreign words, unrecognizable stuff, …)

  • [EOS] is the end-of-sequence token

  • [PAD] is a padding token (to make sequences of equal length, see below)

# vocab definition: all units that we want to model
# include special tokens:
# like "[UNK]" for unknown words (here: "subito")
vocabulary = ["Colin", "was", "here", "went", "away", "[UNK]", "[EOS]", "[PAD]"]
n_vocab = len(vocabulary)

Using the elements of the vocabulary, we can represent the raw input sentences as the following two sequences of tokens, respectively as sequences of integers. These steps are also known as tokenization and indexing. NB: we make sure that all input sequences are of equal length by adding [PAD] tokens at the end. (It is also possible to process various-length sequences, but here we will use padding.)

# tokenization: sequence of tokens from the vocabulary
input_tok1 = ["Colin", "was", "here", "[EOS]", "[PAD]"]
input_tok2 = ["Colin", "went", "away", "subito", "[EOS]"]

# indexing: represent input as tensor of indices
input_ind1 = torch.tensor([0,1,2,6,7])
input_ind2 = torch.tensor([0,3,4,5,6])

Finally, we create the training batches. A batch is, essentially, a list of index-sequences.

batch_size = 2
sequence_length = 5

batched_input = torch.cat([input_ind1, input_ind2]).reshape(batch_size, sequence_length)
print(batched_input)
tensor([[0, 1, 2, 6, 7],
        [0, 3, 4, 5, 6]])

A single forward pass#

A minimal single forward pass of an LSTM model applied to a single input vector (=one sequence of indices) consists of the following steps:

  1. word embedding: each index is mapped onto an embedding vector; so the input vector is mapped onto a matrix of word embeddings;

  2. LSTM encoding: the word embeddings are mapped onto a hidden vector representation

  3. linear projection: the hidden vector representation is mapped (linearly) onto a vector of weights (one weight for each word in the vocabulary)

  4. softmax: the next-word probabilities are obtained by taking a softmax on the weights from the previous step

Word embeddings#

We create embeddings with PyTorch’s ’nn.Embedding’ class. On instantiation, the embedding is initialized randomly.

# embedding: vector-representation of each word
# needs parameter for the size of the embedding
n_embbeding_size = 3
embedding = nn.Embedding(n_vocab, n_embbeding_size)
# embedding is a matrix of size (n_vocab, n_embedding_size)
for p in embedding.parameters():
    print(p)
Parameter containing:
tensor([[-1.6845,  1.0559, -0.1922],
        [-0.0187, -1.0131,  0.3889],
        [ 0.1033,  0.3581, -0.6393],
        [-0.8631, -0.7322,  1.7196],
        [-0.6822, -0.2596, -1.4390],
        [ 1.5069,  1.3545, -0.8460],
        [-1.1044, -0.9878,  0.0891],
        [ 1.3549, -0.1035,  0.7188]], requires_grad=True)

Exercise 6.1.1: Check parameters of the embedding

  1. [Just for yourself] Check the dimension of this output. Where is the embedding for “Colin”? Where is the embedding for “[EOS]”?

The embedding accepts batched input. Check the dimensions of the output:

# the 'embedding' takes batched input
input_embedding = embedding(batched_input)
print(input_embedding)
#+begin_example
tensor([[[-1.6845,  1.0559, -0.1922],
         [-0.0187, -1.0131,  0.3889],
         [ 0.1033,  0.3581, -0.6393],
         [-1.1044, -0.9878,  0.0891],
         [ 1.3549, -0.1035,  0.7188]],

        [[-1.6845,  1.0559, -0.1922],
         [-0.8631, -0.7322,  1.7196],
         [-0.6822, -0.2596, -1.4390],
         [ 1.5069,  1.3545, -0.8460],
         [-1.1044, -0.9878,  0.0891]]], grad_fn=<EmbeddingBackward0>)
#+end_example

Exercise 6.1.2: Check batched embedding

  1. What do the dimensions of the output represent here? How would you retrieve the embedding of the word “went” (give your answer in the form ’inputembedding[someIndex, 1:27, SomeOtherIndex, -1]’)?

The LSTM encoding#

We instantiate an LSTM function via ’nn.LSTM’. NB: we define a stacked LSTM, with two layers; we must declare to use ’batch first’ mode, which expects the input batches in the first dimension of the input (which is how we have set up the input).

n_hidden_size = 4

# instantiate an LSTM
lstm = nn.LSTM(input_size  = n_embbeding_size,
               hidden_size = n_hidden_size,  # size of hidden state
               num_layers  = 2,              # number of stacked layers
               batch_first = True            # expect input in format (batch, sequence, token)
               )

The LSTM function takes a batch of input vectors and outputs three things:

  1. the sequence of hidden states in the final layer (of stacked LSTMs), one hidden state representation for each word in the input

  2. the hidden states of the last word in the input sequence for each layer

  3. the cell states of the last word in the input sequence for each layer

# apply LSTM function (initial hidden state and initial cell state default to all 0s)
# LSTM maps a (batched) sequence of embeddings, onto a triple:
#   outcome: sequence of hidden states in final layer for each word
#   hidden:  hidden states of last word for each layer
#   cell:    cell states of last word for each layer
output, (hidden, cell) = lstm(input = input_embedding)

print("\nLSTM embeddings in last layer for each word:\n", output)
print("\nHidden state of last word for each layer:\n", hidden)
print("\nCell state of last word for each layer:\n", cell)
#+begin_example

LSTM embeddings in last layer for each word:
 tensor([[[-0.0875, -0.0315,  0.1433, -0.0494],
         [-0.1209, -0.0712,  0.1808, -0.0598],
         [-0.1449, -0.0917,  0.2229, -0.0865],
         [-0.1400, -0.1031,  0.2199, -0.0794],
         [-0.1547, -0.1138,  0.2284, -0.0885]],

        [[-0.0875, -0.0315,  0.1433, -0.0494],
         [-0.1131, -0.0727,  0.1609, -0.0405],
         [-0.1364, -0.0876,  0.2160, -0.0749],
         [-0.1546, -0.1168,  0.2396, -0.0900],
         [-0.1496, -0.1224,  0.2237, -0.0815]]], grad_fn=<TransposeBackward0>)

Hidden state of last word for each layer:
 tensor([[[ 0.0973,  0.1651, -0.0416,  0.0965],
         [-0.1636,  0.1286, -0.2748,  0.1047]],

        [[-0.1547, -0.1138,  0.2284, -0.0885],
         [-0.1496, -0.1224,  0.2237, -0.0815]]], grad_fn=<StackBackward0>)

Cell state of last word for each layer:
 tensor([[[ 0.1286,  0.3028, -0.0894,  0.2619],
         [-0.3512,  0.2367, -0.4396,  0.2293]],

        [[-0.4054, -0.2071,  0.5762, -0.2517],
         [-0.4338, -0.2238,  0.5907, -0.2343]]], grad_fn=<StackBackward0>)
#+end_example

Exercise 6.1.3: Dive into structure of output representations

  1. There are two (important) vectors that are represented twice in this print-out. Which vectors are that (i.e., what do they represent)? How would you assess these vectors in Python (give your answer in the for: vector A is in ’cell[1,2,3]’ and ’cell[23,224,-1]’)?

Linear mapping onto weights (non-normalized predictions)#

The output representation for the final word in a sequence (the LSTM embedding associated with that word, after having processed the previous sequence), will be used to make predictions for the next word (in a left-to-right language model). We therefore map the LSTM embeddings of the last word onto a vector of the same length as the vocabulary size.

linear_map = nn.Linear(n_hidden_size, n_vocab)
weights = linear_map(output)

Next-word probabilities from softmax-ing weights#

Finally, we map the non-normalized weights onto next-word probabilities with a softmax transformation:

# get probabilities from output weights
def next_word_probabilities(weights):
    softmax = torch.nn.Softmax(dim=2)
    return(softmax(weights).detach().numpy().round(4))

# without any training: next word weights have high entropy
print(next_word_probabilities(weights))
#+begin_example
[[[0.1108 0.1058 0.162  0.1153 0.1603 0.1126 0.1608 0.0724]
  [0.1132 0.1084 0.1613 0.1145 0.1644 0.1115 0.1561 0.0707]
  [0.115  0.1098 0.1609 0.1127 0.1683 0.1122 0.1521 0.069 ]
  [0.115  0.1103 0.1603 0.1135 0.1685 0.1114 0.1521 0.069 ]
  [0.1159 0.1108 0.1601 0.1128 0.1699 0.1112 0.1507 0.0685]]

 [[0.1108 0.1058 0.162  0.1153 0.1603 0.1126 0.1608 0.0724]
  [0.1126 0.1083 0.1611 0.1159 0.1628 0.1103 0.1575 0.0714]
  [0.1144 0.1096 0.161  0.1135 0.1672 0.1119 0.153  0.0693]
  [0.116  0.1112 0.16   0.1128 0.1705 0.1114 0.15   0.0682]
  [0.1158 0.1111 0.1596 0.1136 0.1698 0.1105 0.1509 0.0686]]]
#+end_example

Exercise 6.1.4: Understand output format

  1. Which indices give you the next-word probabilities for the sequence “Colin was”?

  2. What’s the most likely word, according to this output, after sequence “Colin”?

  3. What’s the most likely word, according to this output, for the first word in a sentence, i.e., after the empty sequence “”? [Hint: if you think that this is a strange question, or that something stands in the way of answering it directly and smoothly, don’t answer it directly and smoothly, but explain why the question is odd or what needs to change in order to be able answer the question.]