How Transformers Work

The architecture behind every modern LLM

1 / 7Before Transformers

Before Transformers

Before 2017, the dominant approach for processing sequences of text was Recurrent Neural Networks (RNNs) and their variant, LSTMs. These processed text one word at a time, left to right, passing a hidden state from each step to the next — like reading a book and trying to remember everything in a single note.

The problem: RNNs struggled with long sequences. By the time you reached word 500, the network had largely forgotten word 1. They were also inherently sequential — you couldn't process word 10 until you'd processed words 1 through 9 — making them slow to train.

Prev lesson

← → arrow keys to navigate