Lesson 2·30 min·Free
How Transformers Work
The architecture behind every modern LLM
1 / 7
Before Transformers
Before 2017, the dominant approach for processing sequences of text was Recurrent Neural Networks (RNNs) and their variant, LSTMs. These processed text one word at a time, left to right, passing a hidden state from each step to the next — like reading a book and trying to remember everything in a single note.
The problem: RNNs struggled with long sequences. By the time you reached word 500, the network had largely forgotten word 1. They were also inherently sequential — you couldn't process word 10 until you'd processed words 1 through 9 — making them slow to train.
← → arrow keys to navigate