Introduction

Large Language Models are unique in that they can respond to a user's query in a natural manner. This enables more natural forms of conversation and makes them invaluable for any task that requires parsing and generating text.

Attention

LLMs are powerful because of the self-attention mechanism that they've learnt how to apply. There are two forms of self-attention that are widely used.

Self-Attention : This is when a model is able to weight the importance of every previous token w.r.t itself. It cannot see tokens that are after it.
Bidirectional Self-Attention: This is when a model is trained to utilise the full context of the tokens. it can see every single token in the context window.

GPT mainly uses Self-Attention while something like BERT uses Bidirectional Self-Attention. This is because GPT is a decoder based model while BERT is a encoder based model.

Training

LLMs are trained in a few stages. These can be described as

Base Foundation Model training : We train the large language model on a large corpus of unlabelled text. It's sole job is to complete the text chunk and predict the next token
Fine-Tuning : Once we've obtained a fine-tuned LLM, we can then fine-tune it to learn how to follow instructions or for classification tasks.

Next Token Prediction

When training LLMs, we are able to utilise [[Self Supervised Learning]] because our models are trained to predict the next token. This means that we have natural labels generated from the data

Interestingly, this creates the concept of [[Emergent Behaviours]] which are capabilities that the model wasn't trained for. This is because of the diverse context of the information that the model is exposed to.

Training Datasets

When working with multi-modal data, we need to convert the data into a vector representation. This is done through an [[encoder]] which helps us to convert the data into a representation that our LLM can work with.

An early example of an encoder that worked was [[Word2Vec]] that trained neural network architectures to generate embeddings by predicting a completion for a word given a set of target words. When we are training our LLMs, they learn to generate these representations during their training.

This makes it more efficient to utilise a LLM's own native embeddings if possible because they are task-specific as compared to a more general implementation like Word2Vec.

Tokenisation

When dealing with text, we need to tokenise it into known character ngrams that the model learns about during the training stage. This means that our model learns a mapping of token->embedding in it's initial embedding layer which helps it to make good predictions.

It's important here to emphasise that our model never sees raw text. It just sees the token-id or the raw embeddings that the encoder produces. Therefore, it's important to have a large and diverse training set so that our model has a large vocabulary to work with so that our model is able to learn a useful representation.

There are two ways that we've tried to deal with this problem of unknown tokens

We can have a [UNK] token which is used to represent unknown tokens
We can use an algorithm like BPE that recursively breaks down tokens into subwords and single characters, this means that for relative rare words or non-english characters, we might end up using a lot of tokens

Special Tokens

In general, there are a set of special tokens that we can use to indicate to our model to treat it differently.

[BOS] : This marks the start of a text chunk and signifies to the LLM where a piece of content begins.
[EOS] : This token in positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts.
[PAD] : This token is useful when we're training LLMs with batch sizes larger than one

Some other models might even include special tokens to be able to implement special cases like Function Calling.

Byte Pair Encoding

Popular LLMs like GPT-2, GPT-3 were trained using an algorithm called Byte Pair Encoding. This is implemented most popularly in a library called tiktoken.

On a high level, BPE works by building a vocabulary through iteratively merging frequent characters into subwords and frequent subwords into words. This helps it to build a rich vocabulary that preserves the token space and for common words or subwords to be represented more efficiently.

Data Sampling

Since we are doing next token prediction, our data gives us labels. But since our models need to operate on a fixed number of tokens, how can we construct the training dataset?

Embeddings

There are two forms of embeddings that we'd like our model to learn

Absolute Embeddings : These are the embeddings for specific tokens that we care about
Positional Embeddings : These are the embeddings that represent the information for a specific position. While these can be learnt or calculated in advance, ultimately, it's just used to help the model understand the concept of position.

These size of these embeddings are going to depend on the specific model that we're looking at.

Attention

Attention is an architectural choice. It was first implemented in RNN but ultimately found its biggest use in the Transformer Architecture where it was used in an encoder-decoder framework.

There are three main variants that we see in popular usage

Self-Attention
Causal Attention
Multi-Head Attention

Attention in transformers was created to solve the complexity of machine translation tasks.

This created the first few [[Encoder Decoder]] models which would first encode a source sentence into a vector representation of hidden states before passing it to a decoder which would then generate a predicted translation.

Self Attention

With Self Attention, each position in the input sequence attends to all positions in the same sequence. This allows it to weight the importance of information for each individual position w.r.t itself.

This is done by

First we compute a hidden state for each individual token
Then we take the dot product between each hidden state to calculate the dot product. This means that we perform O(n^2) operations
Lastly, for each row, we normalise it using a softmax operation so that the dot products sum to 1
We then create a new weighted sum based off the weightage of individual dot product values

Self Attention in the Transformer

In the transformer, they introduce the concept of key, query and value matrices that we multiply our hidden states by. These are useful in helping to cast our original hidden states into a better representation that we can fine-tune during training.

Once our model has computed the new matrices Q,K,V we can then take Q@K to get a nxn matrix which we then scale using $\sqrt{d_k}$ before applying a softmax. This is known as [[Scaled Dot Attention Product]] which was first introduced in the original transformer paper.

Once we've gotten the normalised attention weights that sum to 1, we can then compute a new representation for each token by summing up the respective representations of the V matrix that it corresponds to.

Multi-Head Attention

By having multiple heads working in parallel, we can train our model to have different heads that learn different aspects of the data. This enables it to attend to information from different representation subspaces at different positions.

Causal Attention

Causal Attention is a specialised form of attention that only allows the model to only consider previous and current inputs.

This means that we mask out the attention weights above the diagonal before computing the softmax. This prevents the model from looking ahead and cheating. This can be done with a simple tril mask

context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)

This creates a mask as seen below.

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])

We can then apply a simple element wise multiplication so that we only preserve the attention scores from the positions that have a 1.

Dropout in Attention

Sometimes, we might also implement dropout in the attention calculation by randomly masking out specific attention scores before we compute the softmax.

This helps our model to not be dependent on a specific set of hidden layer units. This helps our model to generalise better to its training dataset.

Multi-Head Attention

In practice, when we use attention in production, we use something called multi-headed attention.

If we have an attention layer that has k heads which takes in a batch of n inputs that have a dimensionality of d, then we are going to

Multiply the same input by each head. Each head has a dimensionality of d x (d//k). This will cast the input down to a smaller dimensionality.
This means that we generate k unique combinations of our original QKV matrices
We then perform standard attention on each of these combinations, generating a final list of k vectors for each input that has dimensionality of d//k each.
We concatenate each of the k vectors for each input, thus giving us a final result that has dimensionality of n x d as we originally passed in

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length,
                 dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
             for _ in range(num_heads)]
        )
    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

GPT Models

Now that we've walked through Attention and the Tokenisation aspect of these large language models. Let's work on learning more about their remaining aspects

Layer Normalisation
GeLU
Skip Connections

Layer Normalisation

When our inputs pass through a single linear layer, their mean and variance increase. This can be difficult when we have a large number of linear layers stacked together, causing the individual logits to increase rapidly in their magnitude.

This is done by normalising the mean to 0 and variance to 1 of each individual input value.

out = linear(x)
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)

out_norm = (out - mean) / torch.sqrt(var)

In a LLM, we typically implement this by using the following calculation where the final normalised value is scaled up by a scale parameter before being adjusted by a shift parameter.

This is done to allow the LLM to learn these trainable parameters to better make predictions.

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

GeLU

Gaussian Error Linear Units are a type of [[Activation Function]]. We can compute the value of a GeLU activation function using the formula $\text{GeLU} = \phi(x)$ where $\phi$ represents the cumulative distribution function of the standard gaussian distribution.

In practice, we approximate it with the formula

$$ $\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}} \cdot \left(x + 0.044715 \cdot x^3\right)\right]\right) $$

We can see here that GeLU gives a little more wiggle room as compared to the convention [[ReLU]] function that we're used to.

Skip Connections

Skip Connections were originally proposed to help overcome the issue of vanishing gradients. This is done by adding the results of an earlier input to the final result of a layer.

This helps the gradients to flow better and ultimately makes it easier for us to train our model since our weights update more easily.

Model Size

We can estimate model size using the following code snippet

total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Language Models.md

Large Language Models.md

Introduction

Attention

Training

Next Token Prediction

Training Datasets

Tokenisation

Special Tokens

Byte Pair Encoding

Data Sampling

Embeddings

Attention

Self Attention

Self Attention in the Transformer

Multi-Head Attention

Causal Attention

Dropout in Attention

Multi-Head Attention

GPT Models

Layer Normalisation

GeLU

Skip Connections

Model Size

Files

Large Language Models.md

Latest commit

History

Large Language Models.md

File metadata and controls

Introduction

Attention

Training

Next Token Prediction

Training Datasets

Tokenisation

Special Tokens

Byte Pair Encoding

Data Sampling

Embeddings

Attention

Self Attention

Self Attention in the Transformer

Multi-Head Attention

Causal Attention

Dropout in Attention

Multi-Head Attention

GPT Models

Layer Normalisation

GeLU

Skip Connections

Model Size