The Transformer Blueprint

Predecessors

MLP

Due to their feedforward design, they can not preserve the order of information in a sequence. Sequence data lose meaning when the order of the data is lost. Thus, the inability of MLPs to preserve order of information make them unsuitable for sequence modelling.

Also, MLPs takes lots of paramaters which is another undesired property a neural network can have.

CNN

Most known for processing images and other modalities such as texts and videos.

RNN

Transformer

Encoder

Encoder transforms input sequence into compressed representation.

Each encoder block has 3 main layers which are multi-head attention(MHA), layer norm, and MLPs(or feedforward according to the paper).

Decoder

Pretty much the same as encoder except additional multi-head attention that operated over the output of the encoder.

decoder is highly similar to encoder except that the self-attention in decoder is masked to prevent the model to look at subsequent tokens when generating current token.

The goal of the decoder is to fuse encoder output with the target sequence and to make predictions(or to predict the next token).

Decoder is also typically repeated the same times as encoder.

Attention

In essence, attention is a mechanism that can allow the neural network to pay more attention to the part of input data that contains meaningful information and pay less attention to the rest of the input.

Multi-Head Attention

Multi-head attention is basically multiple independent attentions computed over linearly projected QKV vectors.

it doesn’t increase the overall computation cost because the dimension of each head is oneth of number of heads(i.e, heads in base transformer is 8) of the overall model dimension(ie, 512).

computing multiple attentions in parallel allows the model to “jointly attend to information from different representation subspaces at different positions.”

Embeddings

These embedding layers convert input or output tokens into dense vectors of a fixed size, essentially mapping each token in a sequence to a specific dense vector.

Utilizing embeddings is a standard practice in language modeling due to the semantic depth they provide. With these embedded token vectors, those bearing similar semantic meanings tend to align in the same direction.

Positional Encodings

serve as integral components in the initial stages of both the encoder and decoder within a Transformer model. They are used to preserve the order of tokens in a sequence.

This stems from the inherent permutation invariance of the attention mechanism, whereby modifying the order of tokens does not alter the output weighted values4. Consequently, the attention mechanism, on its own, lacks awareness of the token order. As the transformer architecture does not incorporate any other recurrence methods, positional encodings are introduced to equip the model with positional awareness of the tokens in the sequence.

In essence, without positional encodings, a Transformer would indeed exhibit permutation invariance. However, such a design would fall short for tasks where sequence order holds significance, as is the case for most NLP tasks.

Residual Connections

Residual connections alleviate unstable gradient problems and they help the model to converge faster.

residual connections carry positional information to higher layers, among other information.

Layer Normalization

Layer normalization significantly reduces the training time by normalizing the activations of a layer with the layer mean and variance.

Unlike batch normalization(Ioffe and Szegedy 2015) that normalizes each layer with mean and variance computed over the mini-batch, layer norm just normalizes each layer with the mean and variance of each activation.

Layer normalization maintains similar behavior during both training and testing phases, unlike batch normalization which exhibits different behaviors in these two stages.

2 kinds:

Post-LN
Pre-LN

Linear and Softmax Layers

The linear layer after decoder takes the decoded activations and project them to the size of the vocabulary. This linear layer will basically produce logits. The softmax layer will take those logits and turn them into next-token probabilities. The next predicted token will be basically the argmax of softmax output.

Pros

每个输出 token 都可以直接访问输入序列的所有 token，从而保存了整体语境
- 传统 RNN（LSTM、GRU）都存在 long-term dependencies 的问题，导致模型保留信息的能力随时间弱化
- CNN 的感知域是本地的且依赖于 kernel size，导致语境的损失
可以并行处理输入序列的所有 token
一定程度的可解释性，因为 attention 高亮了对于生成特定输出最重要的输入序列部分
在 NLP 等领域达到 SOTA

Cons

某些种类 attention 的内存消耗、计算代价达到输入序列长度的平方
- 诸如 sparse attention, local attention 等策略被提出来应对这一点
仍需要根据特定任务采用顺序方法：transformers 逐个生成输出标记，并持续进行迭代过程，直到完整产生所需的输出序列为止
可解释性不够强

The Transformer Blueprint

Predecessors

MLP

CNN

RNN

Transformer

Encoder

Decoder

Attention

Multi-Head Attention

Embeddings

Positional Encodings

Residual Connections

Layer Normalization

Linear and Softmax Layers

Pros

Cons

Training

Encoder-only LLMs

Decoder-only LLMs

Related posts

李治中：癌症的真相

Smoking for the state

Notes on Django