Xavier Initialization
: All layers of the transformers initialized with xavier uniform. Xavier Uniform
Gradient Clipping
: Gradient clipping to avoid exploding gradient problem. Gradient Clipping
SGD with optimizer
: Got from official pytorch implemenation of transformers. SGD optimizer and scheduler
Adam Optimizer with scheduler
: As mentioned in the transformers paper. Transformers
Beam Search with length normalization
: Beam search avoids neural text Degeneration. Beam Search
Avoid Neural Degenaration with Nucleus Sampling
: Nucleus Sampling works better than Beam Search. Nucleus Sampling
Optimal No. of Heads
: Based on paper Are Sixteen Heads Really Better than One? Paper
Pytorch Implementation of Transformers Explained with Comments
The Transformer are based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. These models are superior in quality while being more parallelizable and requiring significantly less time to train. In this document we will describe the transformer model completely and finally make our transformer model in PyTorch and test it on Cornell Movie Dialogs Corpus to show some interesting result.
.The whole input is fed into transformer at once, whereas for sequential models like rnns, one at a time.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
There is a high correlation between 'man' and 'battle' and 'man' and 'struggle' which is captured by self attention.
This gives the model the advantage of focusing on different words h ways (h is the number of heads). It broadens the model’s capability to focus on different positions and gives the attention layer multiple different representations.
In one head 'heroes' is attending to 'powers' and 'graced'
In another head 'heroes' is attending to 'path' and 'choose'
The full model architecture of the transformer. (Image source: Fig 1 & 2 in Vaswani, et al., 2017.)
First we encode every word into embedding vector i.e choose glove embedding, and since transformer accepts sentences so we define the Max Length which is no. of word embedding to be passed. Finally, we process the input in batches so a final tensor of Embedding Dimension * Max Length * Batch Size is processed.
The input to the transformer is embedding dimension times Max length and we give batches of those.
This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position (unlike common RNN or ConvNet architectures), thus, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words.
A real example of positional encoding with a toy embedding size of 4 (The Illustrated Transformer by Jay Allamar)
The General Framework of Attention is given by
Attention(Q,K,V) = Softmax(Q KT / dh)V
where Q is Query Vector, K is Key Vector and V is Value vector.
Here d_h is embedding size/h and h is no. of attention heads.
In case of Multi-Head attention we have, For each head i: headi = Attention(QWiQ, KWiK, VWiV)
Finally all the attention head is concatenated and is passed through linear layer of same size as input so that the dimensions do not alter. We computed ’h’ different attention heads. Concatenation of heads is not enough to transfer information between heads and so the concatenated heads are passed through the linear layer.
We are learning what’s left of (residual), without learning a new representation. You are learning the ’remaining’ only. If the block doesn’t learn anything, then your F(X) would be 0, and that it what makes the training go much faster, since learning a completely new representation is omitted. Therefor , the model can default to using the identity function if the layer is not beneficial.
Either learn something useful, or don’t learn anything!
In order to prevent the values of the outputs from becoming bigger. We have performed a lot of operations which may cause the values of the layer output to become bigger.So we use Layer Norm to normalize them back again.
For self-attention, we don’t want our decoder to attend to future word. Otherwise, the model will cheat and learn to look at future words. At testing time, we don’t have future words! We are predicting one word at a time(running the decoder for a number of timesteps, just like an LSTM at testing time). So this will be incompatible during testing(inference). Therefore, the decoder is only allowed to attend to earlier positions. During testing time, it can only attend to what has been generated so far. So we need to resemble the testing time scenario during training as well.
This project requires Python and the following Python libraries installed:
If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.
dataset.py
: Reads the implemented tokenized dataset. (Cornell Movie dialog Corpus
)
model.py
: Generic implementation of pytorch transformers.
train.py
: Training Loop
config.py
: Configuration of the model
chat.py
: Loads the model and allows interactive chatting on terminal.