What Every Programmer Should Know About Transformers

An interactive guide to understanding how transformer models work — from tokenization to attention to modern LLMs

~15,000 words · 9 interactive demos · Tokenization · Attention · LLMs

Introduction — Why Transformers Matter to Programmers

Every time you accept a suggestion from GitHub Copilot, ask a question to ChatGPT, run a semantic search query, or generate embeddings for a RAG pipeline, a transformer model is doing the work. Transformers aren’t just one more machine learning model — they are the architecture behind the current generation of AI tools that programmers interact with daily.

And yet, most programmers who use these tools have little idea what’s happening under the hood. That’s not a character flaw; it’s a content gap. The original “Attention Is All You Need” paper is dense. Most tutorials either oversimplify (“it’s like a lookup table!”) or assume you already know what a gradient is. Neither is helpful when you’re trying to build real intuition about why your 128k-context API call is slow, why the model hallucinates, or what “temperature” actually does to the output.

What this article covers

This article walks you through the transformer architecture step by step, from raw text input to generated output. You’ll learn:

  • How text becomes numbers (tokenization and embeddings)
  • How the model knows word order without processing words sequentially (positional encoding)
  • How attention works — the core mechanism that lets the model weigh which parts of the input matter for each part of the output
  • How attention, feed-forward networks, residual connections, and normalization combine into one “transformer block” — and why stacking many blocks creates a powerful model
  • How encoder-decoder and decoder-only architectures differ, and why GPT-style models use decoder-only
  • How models are trained (pre-training, fine-tuning, RLHF) and how inference works (the autoregressive generation loop, KV cache, sampling)
  • How reasoning models like OpenAI o3 and DeepSeek-R1 extend the transformer with chain-of-thought inference and test-time compute scaling
  • Why GPT, Claude, Gemini, Qwen, and DeepSeek all use transformers but produce different results

What you need to know beforehand

You should be comfortable with programming. You don’t need a machine learning background. When we encounter math (and we will), we’ll always show the equivalent in code. Think of this article as reading well-commented source code for a transformer — you’ll follow the data flow, understand the shapes, and see what each component does.

How to read this article

Each section builds on the previous one. We use a single running example — “The cat sat on the mat.” — throughout the entire article. You’ll see this sentence tokenized, embedded, attended to, and eventually predicted by the model. Interactive demos let you manipulate the data directly. Play with them — that’s where the real understanding happens.

Let’s start with the big picture.

Summary: Transformers are the architecture behind the AI tools you use every day. This article will take you from zero to a working understanding of how they process text, attend to context, and generate language — all explained with code, not calculus. We'll start with the big picture of what a transformer does and why it was a breakthrough.

The Big Picture — What a Transformer Does and Why It’s Hard

The one-paragraph summary

A transformer is a neural network architecture that takes a sequence of tokens (pieces of text converted to numbers), processes them through a stack of identical layers, and produces a probability distribution over which token comes next. Each layer applies two operations: an attention mechanism that lets every token look at every other token to gather context, and a feed-forward network that processes each token individually. The magic is in the attention — it lets the model capture relationships between any two positions in the input, no matter how far apart they are.

That’s it. Everything else in this article is explaining how each piece works and why it was designed that way.

Why language modeling is hard

Consider our running example: “The cat sat on the mat.”

Predicting the word “mat” requires understanding that we’re talking about a place where a cat might sit. The model needs to connect “mat” back to “cat” and “sat” — words that appeared several positions earlier. In longer text, these dependencies can span hundreds or thousands of words: a pronoun on page 3 referring to a character introduced on page 1.

Language is also deeply ambiguous. “Bank” can mean a financial institution or a riverbank. “They” can refer to different entities depending on context. The correct interpretation depends on attending to the right context — and ignoring the noise.

What came before: RNNs and their bottleneck

Before transformers (pre-2017), the dominant approach was Recurrent Neural Networks (RNNs) and their improved variants, LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units).

An RNN processes text one token at a time, left to right. At each step, it takes the current token and its “memory” of previous tokens (a hidden state vector), and produces an updated hidden state:

# RNN pseudocode
hidden_state = initial_state
for token in sequence:
    hidden_state = rnn_step(token, hidden_state)
output = predict(hidden_state)

This has two fundamental problems:

  1. Sequential bottleneck: Each step depends on the previous step. You can’t parallelize this across a GPU — you must wait for step 1 before computing step 2. Training is slow.

  2. Vanishing information: By the time the model reaches token 500, information about token 1 has been compressed through 500 sequential transformations. Each step overwrites the hidden state with a new summary, and information from early tokens gradually degrades — it doesn’t disappear at a sharp boundary, but long-range dependencies become unreliable as sequence length grows, even with LSTM gates designed to mitigate this.

How transformers solve this

The transformer takes a radically different approach. Instead of processing tokens one at a time, it processes all tokens simultaneously:

# Transformer pseudocode (simplified)
token_vectors = embed(all_tokens)        # All at once
for layer in transformer_layers:
    token_vectors = attention(token_vectors)  # Every token looks at every token
    token_vectors = feed_forward(token_vectors)  # Process each token
next_token_probs = output_head(token_vectors[-1])  # Predict next

Key differences:

  • Parallel: All tokens are processed at the same time. No waiting. GPUs love this.
  • Direct access: When processing “mat”, the model can look directly at “cat” in one step — no information needs to travel through 3 intermediate positions.
  • Constant path length: Information between any two tokens travels through the same number of layers, regardless of their distance in the sequence.

The tradeoff? Attention is O(n²) — every token looks at every other token, so doubling the sequence length quadruples the computation. We’ll cover the implications of this in the masking section.

The running example

Throughout this article, we’ll trace “The cat sat on the mat.” through every stage of a transformer. In the next section, we’ll start with the very first step: how this sentence becomes a sequence of numbers that a neural network can process.

Summary: A transformer is a function that takes a sequence of tokens and predicts the next one. The hard part is that language has long-range dependencies, ambiguity, and structure that simple sequential processing can't capture. Transformers solve this with parallel attention over the entire input. In the next section, we'll start with the first step: how text becomes numbers through tokenization.

Tokenization — How Text Becomes Token IDs

A transformer doesn’t see text. It sees numbers. The first step in any transformer pipeline is tokenization: converting a string of characters into a sequence of integer IDs.

Why not just use characters or words?

You might think: “just split on spaces and assign each word a number.” The problem is vocabulary size. English has hundreds of thousands of words, and programming languages, proper nouns, and multilingual text make it worse. A vocabulary of 500,000 words means the model’s output layer has 500,000 neurons — expensive and wasteful since most words are rare.

Character-level tokenization goes the other direction: the vocabulary is tiny (~256 ASCII characters), but sequences become very long. “transformer” is 11 characters instead of 1 token. Longer sequences mean more computation (remember: attention is O(n²)).

Subword tokenization: the sweet spot

Modern transformers use subword tokenization — algorithms like Byte Pair Encoding (BPE) and WordPiece, or the SentencePiece library (which implements BPE or unigram algorithms) — that split text into pieces that balance vocabulary size (~32k–100k tokens) with sequence length.

The key idea: common words stay as single tokens, while rare words are split into recognizable pieces.

"transformers"  →  ["transform", "ers"]
"unhappiness"   →  ["un", "happiness"]   (or ["un", "happ", "iness"])
"The"           →  ["The"]
"."             →  ["."]

The tokenizer has a fixed vocabulary — a lookup table from string pieces to integer IDs — that was built during training. You can’t change it after the fact.

Our running example

Let’s tokenize “The cat sat on the mat.” Try it yourself in the demo below — type any text and see how it splits into tokens:

Interactive Tokenizer

Type any text to see how it splits into tokens. Each token maps to an integer ID.

7 tokens
The464
·cat3797
·sat3332
·on319
·the262
·mat2603
.13
3ch4ch4ch3ch4ch4ch1ch
7 tokens: "The" (ID 464), "cat" (ID 3797), "sat" (ID 3332), "on" (ID 319), "the" (ID 262), "mat" (ID 2603), "." (ID 13)
Token ID sequence
[464, 3797, 3332, 319, 262, 2603, 13]

Notice a few things:

  • “The” (capitalized) and “the” (lowercase) may be different tokens — the tokenizer is case-sensitive
  • Spaces are often attached to the following word — ” cat” (with a leading space) is a single token, distinct from “cat”
  • Punctuation is its own token — ”.” is a separate token
  • Token IDs are arbitrary integers — they’re just indices into the vocabulary table

What the model actually receives

After tokenization, our sentence is a sequence of integers — say [464, 3797, 3332, 319, 262, 2603, 13] (actual IDs depend on the specific tokenizer). This is what enters the transformer. The model has no access to the original characters — it only sees these IDs.

Vocabulary sizes across real models

Different models make different tradeoffs in vocabulary size. Larger vocabularies mean more tokens are represented as single entries (better compression, shorter sequences), but also mean a larger embedding table and output layer. Here’s how major models compare:

ModelVocab SizeTokenizer TypeNotes
GPT-250,257BPEByte-level BPE; standard baseline for many projects
GPT-4 / cl100k_base~100,256BPEImproved multilingual and code coverage over GPT-2
Llama 3128,256BPE4× GPT-2’s vocab; much better multilingual compression
Qwen 2.5151,646BPELarge vocab optimized for Chinese + English + code
Gemini~256,000 (est.)SentencePiece (est.)Largest vocab among major models; strong multilingual coverage

Practical implications

Tokenization has direct consequences for anyone building on top of LLMs:

  • API billing is per-token, not per-word. The same English sentence might be 15 tokens with GPT-2’s tokenizer but only 12 tokens with Llama 3’s, because the larger vocabulary compresses common sequences more aggressively. Understanding your tokenizer helps you estimate costs.
  • Different tokenizers produce different token counts for the same text. If you’re comparing context window limits across models (e.g., “128k tokens”), keep in mind that 128k tokens in one model may cover more or less raw text than in another.
  • Multilingual efficiency varies dramatically. Tokenizers trained primarily on English data may split non-Latin text into many small pieces, using 3–5× more tokens for the same content. Models like Qwen 2.5 and Gemini, trained with larger multilingual vocabularies, handle this much better.

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). “Neural Machine Translation of Rare Words with Subword Units.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).
  2. Kudo, T. & Richardson, J. (2018). “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  3. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” arXiv:1609.08144. (Introduces WordPiece tokenization.)
  4. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT 2019. (Uses WordPiece.)

In the next section, we’ll see how these bare integers are converted into rich, meaningful vectors through embeddings.

Summary: Before a transformer can process text, it must be split into tokens — subword units that balance vocabulary size with coverage. Each token maps to an integer ID. Our running example 'The cat sat on the mat.' becomes something like [464, 3797, 3332, 319, 262, 2603, 13] — and that's what actually enters the model. Next, in section 4, we'll see how these integer IDs are transformed into dense vectors through embeddings.

Embeddings — From Token IDs to Vectors

After tokenization, our sentence is a sequence of integers: [464, 3797, 3332, 319, 262, 2603, 13]. But these numbers are arbitrary — token 3797 (“cat”) is not mathematically closer to 3332 (“sat”) just because the IDs are nearby. The model needs a way to represent tokens as vectors that capture meaning.

The embedding table

An embedding is a learned lookup table. It’s a matrix of shape (vocab_size, d_model) — one row per token in the vocabulary, each row a vector of d_model dimensions (typically 768, 1024, or larger).

# Embedding pseudocode
embedding_table = random_matrix(vocab_size=50257, d_model=768)  # Initialized randomly

# Look up the embeddings for our tokens
token_ids = [464, 3797, 3332, 319, 262, 2603, 13]
token_vectors = embedding_table[token_ids]  # Shape: (7, 768)

Each token ID becomes a 768-dimensional vector. At initialization, these vectors are random — “cat” has no meaningful relationship to “kitten.” But during training, the model adjusts the embedding table so that tokens used in similar contexts end up with similar vectors.

What “similar” means geometrically

Think of each embedding as a point in 768-dimensional space. We can’t visualize 768 dimensions, but we can project them down to 2D. Here’s what that looks like for tokens from our running example and some related words:

Embedding Space Visualization

Hover over points to see tokens. Notice how semantically similar words cluster together.

Thecatsatonthematdogkittenpuppybirdstoodranwalkedsleptrugfloorcarpettableainatan
Embedding space showing semantic clusters of tokens
Running example
Animals
Actions
Objects / places
Function words

Notice the clustering:

  • “cat”, “dog”, “kitten”, “puppy” are near each other — they’re all animals
  • “sat”, “stood”, “ran”, “walked” cluster — they’re all actions
  • “mat”, “rug”, “carpet”, “floor” cluster — they’re all surfaces/objects
  • “the”, “a”, “an” cluster — they’re function words (articles)

This structure emerges entirely from training. Nobody told the model that “cat” is an animal. The model learned it because “cat” and “dog” appear in similar contexts in the training data.

Embeddings as a programmer concept

If you’ve used word embeddings in a search system or recommendation engine, this is the same idea. The embedding table is a dictionary mapping IDs to vectors:

// Conceptually:
const embeddingTable: Map<number, number[]> = new Map();
// embeddingTable.get(3797) => [0.12, -0.34, 0.56, ...] // 768 numbers for "cat"
// embeddingTable.get(3332) => [-0.21, 0.43, 0.11, ...] // 768 numbers for "sat"

The key insight: embeddings are just a lookup table. There’s no computation here — just a table read. The intelligence is in how the table was trained. This idea has deep roots: earlier methods like Word2Vec [1] and GloVe [2] also represented words as dense vectors, but they were static — every occurrence of “bank” got the same vector regardless of context. The embeddings in a transformer are learned in-model and interact with every other component during training, giving them richer, more context-aware structure.

Shape check

After embedding, our 7 tokens become a matrix of shape (7, 768) — seven rows (one per token), each with 768 values. This matrix is the input to the rest of the transformer. Every subsequent operation works on these vectors.

Embedding dimensions across real models

The embedding dimension (d_model) is one of the most important hyperparameters in a transformer. Larger dimensions can capture more nuanced relationships but cost more memory and compute. Here’s how it varies across well-known models:

ModelEmbedding Dimension (d_model)ParametersNotes
GPT-2 Small768117MBaseline small model
GPT-2 Large1,280762MWider embeddings, better representations
GPT-3 (175B)12,288175B16× wider than GPT-2 Small
Llama 3 8B4,0968BEfficient modern architecture
Llama 3 70B8,19270B2× wider than Llama 3 8B
Qwen 2.5 72B8,19272BSame width as Llama 3 70B

Notice that embedding dimension doesn’t scale linearly with parameter count — depth (number of layers) also increases. But wider embeddings generally mean richer per-token representations.

Practical implications

Embeddings have direct consequences for developers working with language models:

  • Embedding API costs scale with dimension. When you call an embedding API (e.g., OpenAI’s text-embedding-3-small at 1,536 dimensions vs. text-embedding-3-large at 3,072 dimensions), the vector size affects storage costs, retrieval latency, and similarity search performance. For many applications, smaller embeddings are surprisingly effective.
  • Smaller models still capture useful semantics. A 768-dimensional embedding from a well-trained small model can outperform a 12,288-dimensional embedding from a poorly trained large model. Quality of training data and training procedure matter more than raw dimension.
  • Embeddings are the foundation of RAG and search. If you’re building retrieval-augmented generation (RAG) systems, understanding that embeddings are “just” learned lookup tables — with all the biases and gaps that implies — helps you debug retrieval quality issues.

References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” Proceedings of the International Conference on Learning Representations (ICLR) Workshop. (Introduces Word2Vec.)
  2. Pennington, J., Socher, R., & Manning, C. D. (2014). “GloVe: Global Vectors for Word Representation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Introduces GloVe.)

But there’s a problem. Look at the embedding matrix: there’s nothing in it that tells the model which token came first. “The cat sat on the mat” and “mat the on sat cat The” would produce the exact same set of embedding vectors (just in different order). The model needs position information — that’s the next section.

Summary: Token IDs are meaningless integers. Embeddings convert them into dense vectors in a high-dimensional space where semantic relationships emerge: 'cat' and 'kitten' end up close together, 'cat' and 'sat' are far apart. This embedding table is learned during training. Next, in section 5, we'll tackle the problem that these embedding vectors carry no information about word order — and how positional encodings fix that.

Position Information — Teaching Order to an Order-Blind Model

After embedding, our 7 tokens are 7 vectors. But there’s nothing in those vectors that encodes position. The embedding for “cat” is the same whether it’s the second word or the hundredth word. If we fed these embeddings straight into attention, the model would process . as an unordered set — like a bag of words.

Word order matters. “The cat sat on the mat” means something very different from “The mat sat on the cat.”

The solution: add position to the embedding

The fix is simple: add a position-dependent vector to each token’s embedding. If the embedding for “cat” at position 1 is e_cat, and the positional encoding for position 1 is p_1, the model receives e_cat + p_1.

# Position encoding pseudocode
for i, token_vector in enumerate(token_vectors):
    token_vectors[i] = token_vector + position_encoding(i)

Now the model can tell that “cat” at position 1 is different from “cat” at position 5, because they have different positional vectors added. The question is: what should these positional vectors look like?

Approach 1: Sinusoidal encoding (original transformer)

The original 2017 transformer paper used sinusoidal functions — sine and cosine waves at different frequencies — to generate positional encodings. Each dimension uses a different frequency:

def sinusoidal_encoding(pos, d_model):
    encoding = []
    for i in range(d_model):
        angle = pos / (10000 ** (2 * (i // 2) / d_model))
        if i % 2 == 0:
            encoding.append(sin(angle))
        else:
            encoding.append(cos(angle))
    return encoding

Why sines and cosines? Because relative position can be expressed as a linear transformation of absolute positions in sinusoidal space — the model can learn to compute “3 positions ago” from the encoding values. And since the functions are deterministic, they work for any sequence length without training.

Approach 2: Learned positional embeddings

GPT-2 and BERT use learned positional embeddings — a second embedding table, just like the token embedding table, but indexed by position instead of token ID.

position_table = random_matrix(max_positions=1024, d_model=768)
position_vectors = position_table[0:seq_len]  # Look up positions 0..6

This is simpler and often works just as well, but it has a hard limit: the model can only handle positions up to max_positions. You can’t extrapolate to longer sequences without retraining.

Approach 3: RoPE (Rotary Position Embedding)

Modern models like LLaMA, Qwen, and many GPT variants use RoPE — Rotary Position Embedding. Instead of adding a vector, RoPE rotates the query and key vectors by an angle proportional to their position before computing attention.

The key advantage: RoPE encodes relative position naturally. The dot product between a query at position 5 and a key at position 3 depends only on the difference (2), not on the absolute positions. This makes it easier to extend context length after training.

Explore the encodings

Try switching between encoding types and clicking on different token positions to see how the encoding vector changes:

Positional Encoding Visualization

Click on a token to see its positional encoding vector. Toggle between encoding types.

Encoding heatmap (all positions × 32 dimensions)

Dimension →Thecatsatonthemat.
Positional encoding: sinusoidal, position 0 (token "The"), 32 dimensions

Encoding vector for position 0 (“The”)

Dimension

Notice in the heatmap:

  • Sinusoidal: Clean wave patterns. Low dimensions change rapidly across positions; high dimensions change slowly. Each position gets a unique “fingerprint.”
  • Learned: No visible pattern — values are whatever was useful during training.
  • RoPE: Rotation-like patterns that vary by dimension.

After positional encoding

With position information added, our token vectors now carry both meaning (from the embedding) and position (from the positional encoding). The matrix is still shape (7, 768) — same dimensions, but now each vector knows where it is in the sequence.

Positional encoding variants compared

Different models use different approaches, each with distinct tradeoffs for context length, compute cost, and generalization:

MethodUsed ByTypeLearned ParamsMax Context ExtrapolationKey Tradeoff
SinusoidalOriginal Transformer (2017)Fixed (added to embeddings)NoneTheoretically unlimited, but untested in practiceSimple and elegant, but rarely used in modern models
LearnedGPT-2, BERTLearned position embeddings (added)max_len × d_modelHard limit at training lengthEasy to implement, but cannot generalize beyond trained positions
RoPELlama 1/2/3, Qwen, Mistral, PhiRotary (applied to Q, K in attention)NoneExtends beyond training length with interpolationBest generalization; dominant in modern open-source models
ALiBiBLOOM, MPTAttention bias (no position embeddings)NoneExtrapolates well by designSimplest — no position embeddings at all; adds linear bias to attention scores

Practical implications

The choice of positional encoding has a direct impact on what you can do with a model:

  • RoPE enables context length extension after training. Techniques like NTK-aware scaling, YaRN, and dynamic scaling allow RoPE-based models to handle sequences 4–16× longer than their training length with minimal quality loss. This is why most modern models (Llama 3, Mistral, Qwen) use RoPE — it unlocks 128k+ context windows without retraining from scratch.
  • ALiBi is architecturally simpler. Since it adds a linear bias to attention scores rather than modifying embeddings, it avoids the need for a separate position encoding step entirely. For practitioners building custom models, ALiBi can simplify the implementation.
  • Learned embeddings impose a hard ceiling. If you’re working with a model that uses learned positional embeddings (like GPT-2), you cannot process sequences longer than its trained maximum (e.g., 1024 tokens) without retraining the position embedding table — a fundamental limitation that newer methods solve.

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems (NeurIPS). (Introduces sinusoidal positional encoding.)
  2. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864. (Introduces RoPE.)
  3. Press, O., Smith, N. A., & Lewis, M. (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.” Proceedings of the International Conference on Learning Representations (ICLR). (Introduces ALiBi.)

We’re ready for the most important part of the transformer: attention.

Summary: Attention treats its input as a set, not a sequence — it has no built-in notion of word order. Positional encodings fix this by adding a unique position signal to each token's embedding. Sinusoidal encodings use fixed math patterns; learned encodings are trained; RoPE rotates vectors by position, enabling flexible context extension. Next, in section 6, we'll see how attention uses these position-aware vectors to let every token communicate with every other token.

Attention — From Alignment to Multi-Head Self-Attention

This is the most important section in the article. Attention is what makes transformers work. Everything before this section (tokenization, embeddings, positional encoding) was preparation. Everything after (feed-forward networks, residual connections, the full block) builds on top of attention.

We’ll build up to the full mechanism step by step: intuition first, then the history, then the math, then multi-head.

Intuition: what problem does attention solve?

Consider our running example: “The cat sat on the mat.”

When the model processes the word “sat”, it needs context. What sat? It needs to look back at “cat.” When it processes “mat”, it needs to know that “on” and “sat” establish a spatial relationship.

Attention is the mechanism that lets each token “look at” every other token in the sequence and decide how much to care about each one.

For the token “sat”, the attention mechanism might produce something like:

TokenThecatsatonthemat.
Attention weight0.050.500.200.100.050.080.02

“sat” attends strongly to “cat” (weight 0.50) because that’s the subject of the verb. It attends moderately to itself (0.20) and weakly to everything else. These weights sum to 1 — it’s a probability distribution over positions.

The output for “sat” is then a weighted average of all the token vectors, using these weights. The result is a new vector that represents “sat” enriched with the context it needs — especially the information from “cat.”

Historical context: Bahdanau attention (2014)

Attention wasn’t invented for transformers. It was introduced by Bahdanau et al. in 2014 for machine translation with RNNs.

The problem: in an RNN encoder-decoder for translation, the entire source sentence is compressed into a single “context” vector. For long sentences, this bottleneck loses information. Bahdanau’s insight was to let the decoder look directly at all encoder states and compute a weighted average — attending to the relevant source words for each target word being generated.

This “alignment” model was a breakthrough. It solved the information bottleneck and made attention a standard component in sequence models. But it was still bolted onto RNNs.

The transformer’s key contribution (Vaswani et al., 2017) was realizing you could replace the RNN entirely — just use attention. No recurrence needed. That’s the “Attention Is All You Need” in the paper title.

Self-attention: queries, keys, and values

In the transformer, attention operates on the sequence itself — each token attends to every other token in the same sequence. This is called self-attention.

Here’s how it works. The input is a matrix X of shape (seq_len, d_model) — our 7 token vectors. The model computes three matrices from X:

Q = X @ W_Q  # Queries: "what am I looking for?"
K = X @ W_K  # Keys: "what do I contain?"
V = X @ W_V  # Values: "what information do I provide?"

This is critical: Q, K, and V are all computed from the same input X, using three different learned weight matrices (W_Q, W_K, W_V). They are NOT three different inputs — they are three different projections of the same input. Each projection extracts a different “view” of the data:

  • Query (Q): Represents what each token is looking for in the context
  • Key (K): Represents what each token offers to be found
  • Value (V): Represents the actual information that will be retrieved

Think of it like a search engine: the query is your search term, the keys are the page titles, and the values are the page contents. The attention score between a query and a key determines how much of the corresponding value gets included in the output.

The scaled dot-product formula

The full attention computation in one equation:

Attention(Q, K, V) = softmax( (Q × Kᵀ) / √d_k ) × V

And in code:

def scaled_dot_product_attention(Q, K, V):
    d_k = K.shape[-1]
    
    # Step 1: Compute similarity scores between all query-key pairs
    scores = Q @ K.transpose(-2, -1)  # Shape: (seq_len, seq_len)
    
    # Step 2: Scale down to prevent softmax saturation
    scores = scores / sqrt(d_k)
    # Step 3: Convert to probabilities (each row sums to 1)
    weights = softmax(scores, dim=-1)  # Shape: (seq_len, seq_len)
    
    # Step 4: Weighted average of values
    output = weights @ V  # Shape: (seq_len, d_model)
    
    return output, weights

Let’s walk through each step:

  1. Q @ K^T: Compute the dot product between every query and every key. If query_i and key_j point in similar directions, their dot product is high → token i should attend to token j. Result: a (seq_len, seq_len) matrix of raw scores.

  2. / sqrt(d_k): Divide by the square root of the key dimension. Without this, dot products of high-dimensional vectors grow large, pushing softmax into saturation — when input scores are very large, softmax produces a near-one-hot distribution (one value near 1, all others near 0), and gradients vanish for every position except the winner, making training unstable. Scaling prevents this.

  3. softmax: Convert each row of scores into a probability distribution. Each row sums to 1. This is the attention weight matrix — the heatmap you see in the demo below.

  4. weights @ V: Multiply the attention weights by the value matrix. Each output token is now a weighted combination of all value vectors, weighted by relevance.

Explore attention patterns

Switch between modes to see how attention works at each level of abstraction:

Attention Heatmap

Click a row (query token) to highlight its attention distribution across keys.

Key →Thecatsatonthemat.Thecatsatonthemat.0.400.100.050.050.300.050.050.050.550.100.020.030.200.050.030.500.200.080.020.120.050.050.030.080.250.050.500.040.350.050.030.070.400.050.050.030.180.120.150.050.420.050.080.050.150.050.070.100.50
Alignment mode. 7 tokens. Use arrow keys to navigate the heatmap grid.

Alignment: Pre-computed weights showing how each query token aligns with key tokens. Notice "sat" attends strongly to "cat" (subject-verb alignment).

Multi-head attention: attending to different things in parallel

One attention head can only learn one kind of relationship. But language has many simultaneous relationships: syntactic (subject-verb), semantic (what-where), positional (adjacent words), and more.

Multi-head attention runs the Q/K/V attention mechanism multiple times in parallel, each with different learned weight matrices. Each “head” can learn to attend to a different type of pattern:

def multi_head_attention(X, num_heads=8):
    d_model = X.shape[-1]         # e.g., 768
    d_head = d_model // num_heads  # e.g., 96
    
    head_outputs = []
    for i in range(num_heads):    # Conceptual loop — real implementations compute all heads in parallel
        Q = X @ W_Q[i]  # Each head has its own W_Q, W_K, W_V
        K = X @ W_K[i]
        V = X @ W_V[i]
        head_out, _ = scaled_dot_product_attention(Q, K, V)
        head_outputs.append(head_out)
    
    # Concatenate all heads and project back
    concatenated = concat(head_outputs, dim=-1)  # Shape: (seq_len, d_model)
    output = concatenated @ W_O  # Final linear projection
    return output

Each head works in a smaller dimension (d_model / num_heads), so the total computation is the same as a single full-dimensional head. But the model gets 8 (or more) different “views” of the attention.

In the demo above, switch to “Per-Head” mode to see how different heads learn different patterns — one might attend to adjacent tokens, another to syntactically related words.

Attention head configurations in real models

Different models split attention across different numbers of heads and dimensions. Here’s how major models configure their multi-head attention:

ModelAttention Headsd_k (per head)d_modelAttention Variant
GPT-21264768Multi-Head (MHA)
GPT-3 175B9612812,288Multi-Head (MHA)
Llama 3 8B321284,096Grouped-Query (GQA)
Llama 3 70B64 + 8 KV heads1288,192Grouped-Query (GQA)
Qwen 2.5 72B64 + 8 KV heads1288,192Grouped-Query (GQA)

Notice that modern models don’t always use standard multi-head attention. They use variants that trade off between quality and memory efficiency:

MHA vs MQA vs GQA

The original transformer uses Multi-Head Attention (MHA): each head has its own Q, K, and V projections. This is the most expressive variant, but during inference the KV cache grows linearly with the number of heads — a major memory bottleneck for long sequences.

Multi-Query Attention (MQA), introduced by Shazeer (2019), keeps multiple query heads but shares a single set of key and value heads across all queries. PaLM uses MQA. This dramatically reduces KV cache memory (by a factor equal to the number of heads), making inference much faster — but can slightly reduce model quality.

Grouped-Query Attention (GQA), proposed by Ainslie et al. (2023), is a middle ground. Instead of sharing one KV head across all queries (MQA) or having one KV head per query (MHA), GQA groups query heads and shares KV heads within each group. Llama 2 70B, Llama 3, Gemini, and Qwen 2.5 all use GQA — typically with 8 KV heads for 64 query heads (8 queries per group).

VariantKV HeadsModelsKV Cache SizeQuality
Multi-Head (MHA)Same as Q headsGPT-2, GPT-31× (baseline)Best
Multi-Query (MQA)1PaLM1/h × baselineSlightly lower
Grouped-Query (GQA)Groups (e.g., 8)Llama 2/3, Gemini, Qwen 2.5Groups/h × baselineNear MHA

Practical implications: the O(n²) cost of attention

Attention is an O(n²) operation — the attention matrix has shape (seq_len, seq_len). This has direct consequences for anyone using LLM APIs:

  • Doubling your prompt length quadruples the attention cost. A 4K-token prompt requires 16M attention scores per head per layer. An 8K-token prompt requires 64M — 4× more.
  • Context window pricing reflects this. API providers charge more per token for long-context inputs because the per-token compute cost increases with total sequence length, not just linearly with the number of tokens.
  • KV cache management matters. During autoregressive generation, the model caches key and value tensors to avoid recomputation. With MHA, this cache grows as num_layers × num_heads × seq_len × d_k. For GPT-3 (96 layers, 96 heads, d_k=128), a 2K-token sequence requires ~4.5 GB of KV cache in fp16. GQA and MQA reduce this dramatically.

These costs are why the industry is actively pursuing efficient attention methods like FlashAttention [5] (Dao et al., 2022), sliding window attention, and sparse attention patterns.

After multi-head attention, every token’s vector has been enriched with context from the entire sequence. But before the model can generate from these vectors, it needs to learn what tokens it’s allowed to look at — that’s the topic of section 7.


References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017).
  2. Bahdanau, D., Cho, K., & Bengio, Y. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv:1409.0473.
  3. Shazeer, N. (2019). “Fast Transformer Decoding: One Write-Head is All You Need.” arXiv:1911.02150.
  4. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” arXiv:2305.13245.
  5. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems 35 (NeurIPS 2022). arXiv:2205.14135
Summary: Attention is the core innovation of the transformer. It lets each token look at every other token and compute a weighted average, focusing on the most relevant context. Queries, keys, and values are three learned projections of the same input. Multi-head attention runs this process multiple times in parallel, letting the model attend to different types of relationships simultaneously. Modern models use Grouped-Query Attention (GQA) to reduce memory costs. Next: section 7 covers masking.

Masking — Controlling What the Model Can See

In the previous section, every token could attend to every other token. That’s fine for some tasks (like understanding a complete sentence), but it creates a problem for text generation.

Why masking matters for generation

When a GPT-style model generates text, it produces one token at a time, left to right. When generating the 5th token, the model should only use tokens 1–4 as context — it hasn’t generated tokens 6, 7, 8, … yet. They don’t exist.

But attention naturally looks at the entire sequence. Without intervention, the model at position 5 would “see” the token at position 6 — that’s cheating. During training, the model would learn to just copy the answer instead of learning to predict it.

The causal mask

The solution is a causal mask (also called an attention mask or look-ahead mask). It’s a binary matrix that blocks attention to future positions:

# Causal mask for sequence length 7
# True = allowed, False = blocked
mask = [
    [T, F, F, F, F, F, F],  # "The" can only see itself
    [T, T, F, F, F, F, F],  # "cat" sees "The" and itself
    [T, T, T, F, F, F, F],  # "sat" sees "The", "cat", "sat"
    [T, T, T, T, F, F, F],  # "on" sees positions 0-3
    [T, T, T, T, T, F, F],  # "the" sees positions 0-4
    [T, T, T, T, T, T, F],  # "mat" sees positions 0-5
    [T, T, T, T, T, T, T],  # "." sees all positions
]

This mask is applied to the attention scores before softmax. Blocked positions get set to negative infinity (-inf), which makes their softmax output exactly zero — effectively removing them from the weighted average.

# Applied in attention computation:
scores = Q @ K.T / sqrt(d_k)
scores[mask == False] = -infinity  # Blocked positions get -inf
weights = softmax(scores)          # -inf becomes 0 after softmax

See the mask in action

Toggle between masked and unmasked attention to see how the causal mask restricts what each token can see:

Causal Mask — Attention Matrix

Toggle the causal mask to see how future tokens are blocked. Position i can only attend to positions 0..i.

Key (attending to) →Query (attending from) →Thecatsatonthemat.Thecatsatonthemat.1.000.510.490.310.340.340.260.280.220.240.220.200.210.190.190.150.170.170.180.160.160.140.130.170.140.140.140.14
Causal mask enabled. 7 tokens. Use arrow keys to navigate the grid.

The lower-triangular mask ensures each token can only attend to itself and preceding tokens. Cells above the diagonal are set to -∞ before softmax, producing zero attention weight.

Attention weightBlocked (future)

Notice: with the causal mask on, the upper-right triangle is entirely blocked. Token at position i can only attend to positions 0 through i. This is what makes autoregressive generation possible — the model learns to predict the next token using only previous tokens.

Padding masks

There’s a second type of mask: padding masks. When processing batches of sequences with different lengths, shorter sequences are padded with a special [PAD] token to match the longest sequence. The padding mask tells the model to ignore these padding positions — they contain no real information.

# Sequence: "The cat sat [PAD] [PAD]"
# Padding mask: [True, True, True, False, False]

Padding masks are simpler than causal masks but serve the same principle: restricting attention to meaningful positions.

Context window: why N × N matters

Here’s a practical consequence of attention that every programmer should understand.

The attention matrix has shape (N, N) where N is the sequence length. Every token computes a score against every other token. This means:

  • Memory: Storing the attention matrix requires O(N²) memory
  • Compute: Computing all the scores requires O(N²) operations
  • Doubling the sequence length quadruples the cost

This is why LLM APIs have context window limits. A model with a 4k context window allocates memory for a 4,096 × 4,096 attention matrix (per head, per layer). A 128k context window needs a matrix 32× wider and 32× taller — that’s over 1,000× more memory and compute.

Context LengthAttention Matrix SizeRelative Cost
2,0482,048 × 2,048 = 4.2M
8,1928,192 × 8,192 = 67M16×
32,76832,768 × 32,768 = 1.07B256×
131,072131,072 × 131,072 = 17.2B4,096×

This is why:

  • Longer API calls cost more and are slower
  • Models have a maximum token limit (and reject inputs that exceed it)
  • Researchers actively work on sub-quadratic attention variants (sparse attention, linear attention, sliding window)

We’ll revisit context costs in the inference section. For now, understand that the N × N attention matrix is both the transformer’s superpower (any token can attend to any other) and its fundamental bottleneck.

Context windows in real models

The maximum context window determines how much text a model can process at once. This limit has expanded rapidly — but the quadratic cost of attention means longer windows require significant engineering effort (efficient attention kernels, KV cache compression, sliding window techniques):

ModelContext WindowRelease YearNotes
GPT-3.5 Turbo4K / 16K202316K variant added mid-2023
GPT-48K / 32K / 128K2023–2024128K via “Turbo” variant
GPT-4.11M2025First OpenAI model to reach 1M tokens
Claude 3.5 Sonnet200K2024Largest commercial window at launch
Claude 4.5 Sonnet1M2025Extended to 1M tokens
Gemini 1.5 Pro1M / 2M20242M in research preview; 1M in production
Gemini 2.5 Pro1M2025Maintains 1M with improved long-context recall
Llama 3.1128K2024Extended from Llama 3’s 8K via continued training
Llama 4 Scout10M2025Largest context window of any open model; 16-expert MoE
Qwen 2.532K / 128K2024128K for larger variants

Practical implications: working with context limits

Understanding context windows is essential for building LLM-powered applications:

  • API cost scales with context. Most providers charge per input token, but the compute cost per token increases for longer prompts due to the O(n²) attention matrix. Some providers charge a premium rate beyond a base context length (e.g., OpenAI charges more per token for GPT-4 32K than GPT-4 8K).
  • Stuffing the context window isn’t free. Even when a model supports 128K tokens, filling that window is expensive and slow. In practice, retrieval-augmented generation (RAG) — fetching only relevant chunks — is often cheaper and faster than passing entire documents.
  • Long context ≠ perfect recall. Research shows models struggle with information in the “middle” of long contexts (the “lost in the middle” phenomenon). Critical information should be placed at the beginning or end of the prompt.
  • When to use long vs. short context: Use long context for tasks requiring holistic understanding (full document summarization, cross-referencing multiple sections). Use short context with RAG for needle-in-a-haystack tasks (finding specific facts in large corpora).

With attention and masking covered, we have the core mechanism. Next, we’ll add the remaining components that make up a complete transformer block in section 8.


References

  1. OpenAI. (2023). “GPT-4 Technical Report.” arXiv:2303.08774.
  2. Anthropic. (2024). “The Claude 3 Model Family.” Anthropic blog.
  3. Google DeepMind. (2024). “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.” arXiv:2403.05530.
  4. Meta AI. (2024). “The Llama 3 Herd of Models.” arXiv:2407.21783.
  5. Alibaba Cloud. (2024). “Qwen2.5 Technical Report.” arXiv:2412.15115.
  6. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172.
  7. OpenAI. (2025). “Introducing GPT-4.1.” OpenAI blog.
  8. Meta AI. (2025). “Introducing Llama 4.” Meta AI blog.
Summary: Masking restricts which positions each token can attend to. Causal masks prevent the model from looking ahead — essential for autoregressive generation. The attention matrix is N x N, which means compute and memory scale quadratically with sequence length. This is why context windows have limits. Next: section 8 covers FFN, residuals, and layer normalization.

Feed-Forward Networks, Residuals, and Layer Normalization

Attention is the star of the transformer, but it’s not the only component. After attention, each token’s vector goes through a feed-forward network, protected by residual connections and layer normalization. These three ideas are simpler than attention, but without them the transformer wouldn’t train.

The feed-forward network (FFN)

After attention has mixed information across tokens, the FFN processes each token individually — same weights applied to every position, but no communication between positions.

The FFN is just two linear transformations with an activation function in between:

def feed_forward(x, d_model=768, d_ff=3072):
    # Step 1: Expand to a wider dimension (typically 4x)
    hidden = x @ W1 + b1        # Shape: (d_model,) -> (d_ff,)
    
    # Step 2: Non-linear activation
    hidden = gelu(hidden)        # Element-wise, keeps shape (d_ff,)
    
    # Step 3: Project back to original dimension
    output = hidden @ W2 + b2    # Shape: (d_ff,) -> (d_model,)
    
    return output

Why expand to 4× and then compress back? The expansion creates a higher-dimensional space where the model can learn more complex transformations. Think of it as the model “thinking” in a larger space before compressing back to the standard representation.

The GELU activation function (Gaussian Error Linear Unit) is like ReLU but smoother — it allows a small amount of negative values through, which helps training stability.

Residual connections: the skip connection from ResNet

Historical context: Residual connections were introduced by He et al. in 2015 for ResNet, a deep image recognition network. The problem they solved: very deep networks (50+ layers) were harder to train than shallow ones, even though they should be strictly more powerful. Gradients would vanish or explode as they traveled backward through many layers.

The fix is elegantly simple: add the input to the output.

# Without residual: output = f(x)
# With residual:    output = x + f(x)

This means the layer only needs to learn the difference between input and output (the “residual”), not the full transformation. If the layer can’t improve on the input, it can learn to output near-zero, and x + 0 ≈ x — the input passes through unchanged. This makes the network easy to optimize even with 100+ layers.

In the transformer, residual connections wrap both the attention sublayer and the FFN:

# After attention:
x = x + attention(x)

# After FFN:
x = x + feed_forward(x)

Layer normalization

Historical context: Layer normalization (Ba et al., 2016) was developed as an alternative to batch normalization. Batch norm normalizes across the batch dimension (all examples in a mini-batch), which is problematic for sequences of varying length. Layer norm normalizes across the feature dimension (all values in one vector), which works regardless of batch or sequence size.

def layer_norm(x, eps=1e-5):
    mean = x.mean()
    variance = x.var()
    return (x - mean) / sqrt(variance + eps)

Layer norm keeps the values in each vector centered and scaled, preventing them from drifting to extreme values as they pass through many layers. This is critical for training stability.

In the transformer, layer norm is applied after each residual addition:

x = layer_norm(x + attention(x))
x = layer_norm(x + feed_forward(x))

Step through the FFN pipeline

Walk through how a single token’s vector transforms through FFN, residual add, and layer norm:

Feed-Forward Network with Residual Connection

Step through the FFN sublayer: expand \u2192 activate \u2192 project \u2192 residual add \u2192 normalize.

012345residual
Step 0: Input token vector with 8 dimensions.
Step 0: Input8 values0.19-0.400.080.970.340.060.740.8301234567Dimension (8D)

The input token vector (8 dimensions). This is the output of the attention sublayer.

Try toggling the residual connection off to see how dramatically the output changes without the skip connection. With residuals, the output stays close to the input — the network makes a small, controlled adjustment rather than a drastic transformation.

The three ideas together

These aren’t flashy innovations. Each one is simple:

  • FFN: Two matrix multiplies with an activation. Processes each token independently.
  • Residual connections: output = input + sublayer(input). Keeps gradients flowing.
  • Layer normalization: Normalize each vector to zero mean and unit variance. Keeps values stable.

But together, they’re what makes it possible to stack the transformer block many times. Without them, a 96-layer model like GPT-3 would be impossible to train.

FFN variants in modern models

The original transformer uses the standard FFN described above — two linear layers with a ReLU (or GELU) activation. Modern architectures have introduced more sophisticated FFN variants:

FFN VariantActivationUsed ByKey Difference
Standard FFNReLU / GELUOriginal Transformer, GPT-2, GPT-3Two linear layers with activation
SwiGLUSwish + Gated Linear UnitLlama 2/3/4, Mistral, Qwen 2.5/3Gating mechanism adds expressiveness; ~2/3 hidden dim to match param count
MoE (Mixture of Experts)Varies (often SwiGLU)Mixtral 8x7B, DeepSeek-V3, Llama 4 Scout/Maverick, Qwen 3, Mistral Large 3Multiple FFN “experts”; router selects top-k per token

SwiGLU (Shazeer, 2020) replaces the single activation with a gated mechanism: the input is projected twice, one path goes through a Swish activation, and the two are multiplied element-wise. This produces richer representations. To keep the parameter count comparable, SwiGLU models typically use ⅔ of the hidden dimension — e.g., Llama 3 8B uses d_ff = 14,336 (not 4× d_model = 16,384).

Mixture of Experts (MoE) replaces the single FFN with multiple “expert” FFNs plus a learned router that selects the top-k experts for each token (typically k=2). This allows models to have massive total parameter counts while only activating a fraction per token — Mixtral 8x7B has 47B total parameters but only activates ~13B per token, giving near-7B inference speed at near-45B quality. The approach has scaled dramatically: DeepSeek-V3 (2024) uses 256 experts with top-8 routing across 671B total parameters but only activates 37B per token, and Llama 4 Maverick (2025) uses 128 experts across 400B parameters.

Normalization variants

Like FFN variants, normalization has also evolved:

VariantOperationUsed ByKey Difference
Post-Normnorm(x + sublayer(x))Original Transformer (2017)Norm after residual add
Pre-Normx + sublayer(norm(x))GPT-2, GPT-3, most modern modelsNorm before sublayer; more stable training
RMSNormRoot Mean Square onlyLlama 2/3, Gemma, Qwen 2.5Drops mean centering; ~30% faster than LayerNorm in practice (hardware-dependent)

The original transformer applies normalization after the residual addition (Post-Norm). GPT-2 and later models switched to Pre-Norm — normalizing the input before feeding it into the sublayer — which produces more stable gradients during training and enables training deeper models.

RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean-centering step, computing only the root mean square. This is computationally cheaper — in practice, benchmarks on memory-bandwidth-bound hardware have shown roughly ~30% wall-clock speedup over LayerNorm, though the exact gain is hardware- and implementation-dependent — with negligible quality impact. Nearly all modern open-weight models (Llama, Gemma, Qwen) use RMSNorm.

Practical implications

These variant choices have direct engineering consequences:

  • MoE models give you more capability per FLOP. Mixtral 8x7B matches or exceeds Llama 2 70B quality on many benchmarks, despite being roughly 5x faster at inference. DeepSeek-V3 takes this further with 256 experts and 671B total parameters while activating only 37B — achieving frontier-level performance at a fraction of the training cost. However, MoE models require more total memory (all experts must be loaded even though only a subset runs), making them less suitable for memory-constrained deployment.
  • SwiGLU is now standard. If you’re fine-tuning or building on a modern base model, expect SwiGLU FFNs. The gating mechanism makes the hidden dimension calculation non-obvious — check the model config for the actual d_ff.
  • RMSNorm’s ~30% speedup compounds. In a 96-layer model with 2 norm operations per layer, that’s 192 norm operations per forward pass. A ~30% speedup on each (in memory-bandwidth-bound scenarios) adds up to meaningful latency reduction at inference time.

In the next section, we’ll assemble these pieces into the complete transformer block in section 9.


References

  1. He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Deep Residual Learning for Image Recognition.” arXiv:1512.03385.
  2. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). “Layer Normalization.” arXiv:1607.06450.
  3. Shazeer, N. (2020). “GLU Variants Improve Transformer.” arXiv:2002.05202.
  4. Zhang, B., & Sennrich, R. (2019). “Root Mean Square Layer Normalization.” Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
  5. Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” Journal of Machine Learning Research, 23(120), 1-39.
  6. DeepSeek-AI. (2024). “DeepSeek-V3 Technical Report.” arXiv:2412.19437.
  7. Meta AI. (2025). “Introducing Llama 4.” Meta AI blog.
Summary: After attention, each token's vector passes through a feed-forward network that transforms it independently. Residual connections (from ResNet) add the input back to the output, keeping gradients healthy in deep networks. Layer normalization stabilizes the values. These three small ideas — with modern variants like SwiGLU and RMSNorm — make it possible to stack dozens of transformer blocks. Next: section 9 assembles the complete block.

The Complete Transformer Block — One Layer, Then Many

We’ve now covered every component. Let’s assemble them.

One transformer block

A single transformer block (also called a “layer”) applies four operations in sequence:

def transformer_block(x):
    # Sub-layer 1: Multi-head attention with residual + norm
    attn_output = multi_head_attention(x)
    x = layer_norm(x + attn_output)
    
    # Sub-layer 2: Feed-forward network with residual + norm
    ffn_output = feed_forward(x)
    x = layer_norm(x + ffn_output)
    
    return x  # Same shape as input: (seq_len, d_model)

That’s the whole block. Input shape: (seq_len, d_model). Output shape: (seq_len, d_model). The shapes are identical. This is crucial — it’s what makes stacking possible.

Explore the block assembly

See how the components connect, and use the slider to stack multiple layers:

Transformer Block Assembly

See how data flows through the transformer block: Attention \u2192 Add & Norm \u2192 FFN \u2192 Add & Norm. Stack layers to see the full architecture.

Input EmbeddingLayer 1Multi-Head AttentionAdd & NormFeed-Forward NetworkAdd & NormLayer 2Multi-Head AttentionAdd & NormFeed-Forward NetworkAdd & NormOutput
Transformer block diagram with 2 layers. Click Animate Flow to watch data propagate.

2 transformer blocks stacked sequentially. Each layer’s output becomes the next layer’s input. GPT-2 small uses 12 layers; GPT-3 uses 96. Click "Animate Flow" to watch data propagate through the architecture.

AttentionFFNAdd & NormResidual

The hidden dimension stays constant

A key architectural property that’s easy to miss: d_model stays the same through every layer. The input to layer 1 is (seq_len, 768). The output of layer 1 is (seq_len, 768). The output of layer 96 is still (seq_len, 768).

This means:

  • You can stack any number of blocks without shape mismatches
  • Each block refines the same representation, like successive passes of a compiler optimization
  • The model doesn’t “compress” or “expand” the representation — it transforms it in place

Stacking: one layer, then many

Real models stack this block many times:

def transformer(input_tokens):
    x = embed(input_tokens) + positional_encoding
    
    for layer in range(num_layers):
        x = transformer_block(x)
    
    return x  # Still (seq_len, d_model)

What do different layers learn?

  • Early layers (first ~15–25% of depth): Syntax, local patterns, part-of-speech, adjacent word relationships
  • Middle layers (middle ~50% of depth): Semantic meaning, entity tracking, co-reference resolution
  • Late layers (final ~25–35% of depth): Abstract reasoning, task-specific behavior, prediction

This isn’t a strict separation — research shows these categories overlap — but the general pattern holds: complexity builds with depth.

The output head: from vectors to words

After all transformer blocks, we have a matrix of shape (seq_len, d_model) — one vector per token position. But we need to predict the next token — an integer from the vocabulary. This is the job of the output head (sometimes called the “unembedding” layer).

# The output head
logits = x[-1] @ W_output  # Shape: (d_model,) -> (vocab_size,)
# W_output is often the transpose of the embedding table (weight tying)

# Convert logits to probabilities
probs = softmax(logits)  # Shape: (vocab_size,) — sums to 1

# Select the next token
next_token = sample(probs)  # e.g., argmax or temperature sampling

What are logits? Raw, unnormalized scores — one per vocabulary entry. A logit of 5.2 for token “cat” means the model is fairly confident about “cat.” A logit of -3.1 for “banana” means the model thinks “banana” is unlikely.

Softmax converts these raw scores into a probability distribution. All values become positive and sum to 1.

Weight tying: In many models, W_output is the transpose of the embedding table from Section 4. This means the same matrix that maps token IDs to vectors also maps vectors back to token IDs. This halves the vocabulary-related parameter count and encourages consistency between “understanding a word” and “predicting a word.”

Vocabulary size matters: W_output has shape (d_model, vocab_size). GPT-2 uses 50,257 tokens; modern models use 100k–150k. This single matrix can have 100M+ parameters.

By the numbers

Here’s what these abstractions look like in real models:

ModelLayersd_modelHeadsParametersVocab SizeContext Length
GPT-2 Small1276812117M50,2571,024
GPT-2 XL481,600251.5B50,2571,024
GPT-3 175B9612,28896175B50,2572,048
Llama 3 8B324,096328B128,256128K
Llama 3 70B808,1926470B128,256128K
Llama 4 Scout485,12040109B (17B active)202,04810M
DeepSeek-V3617,168128671B (37B active)129,280128K
Qwen 2.5 72B808,1926472B152,064128K

Notice the pattern: more layers, wider embeddings, more heads, larger vocabulary, longer context. The architecture is the same — just scaled up.

A few things to note comparing the older and newer generations:

  • Vocabulary size has grown dramatically. GPT-2 and GPT-3 use ~50K tokens. Llama 3 uses 128K, Qwen 2.5 uses 152K, and Llama 4 uses 202K. Larger vocabularies improve tokenization efficiency (fewer tokens per word, especially for multilingual text) but increase embedding table size.
  • Context has exploded. GPT-2’s 1K and GPT-3’s 2K context windows were limiting. Modern models support 128K tokens — enough for entire books — enabled by RoPE (rotary positional encoding), efficient attention kernels, and continued pre-training on longer sequences. Llama 4 Scout pushes this to 10M tokens using a combination of MoE efficiency and advanced attention mechanisms.
  • MoE decouples total and active parameters. DeepSeek-V3 has 671B total parameters but only activates 37B per token. Llama 4 Scout has 109B total but activates only 17B. This is why the “Parameters” column now shows both figures — the active parameter count determines inference speed, while the total count determines model capacity.
  • Parameter counts have plateaued for dense open models. The arms race from 117M to 175B has given way to a focus on training efficiency and MoE architectures. Llama 3 8B outperforms GPT-3 175B on many benchmarks despite being 22x smaller — better training data and techniques matter more than raw size.

With the complete block assembled, we can now look at the different ways transformers are arranged at the macro level — and then trace our running example through the full pipeline in section 11.


References

  1. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.” OpenAI blog.
  2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
  3. Meta AI. (2024). “The Llama 3 Herd of Models.” arXiv:2407.21783.
  4. DeepSeek-AI. (2024). “DeepSeek-V3 Technical Report.” arXiv:2412.19437.
  5. Meta AI. (2025). “Introducing Llama 4.” Meta AI blog.
Summary: A transformer block is: multi-head attention → add & norm → FFN → add & norm. Stack this block dozens or hundreds of times, and you get a language model. Early layers learn syntax and local patterns; later layers learn semantics and abstraction. The hidden dimension stays constant through all layers, and the output head projects the final vectors to vocabulary probabilities. Next: section 10 covers architectures.

From Encoder-Decoder to Decoder-Only

We’ve now built up every component of a transformer block. But there are different ways to arrange these blocks at the macro level. Let’s look at the three main architectures.

The original: encoder-decoder (2017)

The transformer was invented for machine translation — converting “The cat sat on the mat” (English) into “Le chat s’est assis sur le tapis” (French).

The original design has two stacks of transformer blocks:

Encoder (6 layers):
  Input: Source sentence tokens (English)
  Attention: Bidirectional — every token sees every other token
  Output: Context-rich representations of the source

Decoder (6 layers):
  Input: Target sentence tokens generated so far (French)
  Attention 1: Causal self-attention (can only look at previous target tokens)
  Attention 2: Cross-attention (target tokens attend to encoder output)
  Output: Predictions for next target token

The encoder builds a rich representation of the source. The decoder generates the target one token at a time, attending to both the previous target tokens (self-attention) and the source representation (cross-attention).

Models using this architecture: the original Transformer [1], T5 [3], BART, mBART.

Encoder-only: BERT (2018)

BERT [2] dropped the decoder entirely. It uses only the encoder stack with bidirectional attention — every token sees every other token in both directions.

Encoder (12-24 layers):
  Input: Full text
  Attention: Bidirectional — no masking
  Training: Masked language model (predict hidden tokens)
  Output: Rich representations for understanding

BERT is trained to predict randomly masked tokens — “The [MASK] sat on the mat” → predict “cat.” This forces the model to understand context in all directions.

BERT is excellent for understanding tasks: classification, named entity recognition, question answering. But it can’t generate text — it has no autoregressive mechanism.

Models using this architecture: BERT [2], RoBERTa, DeBERTa, ALBERT, ELECTRA.

Decoder-only: GPT (2018)

GPT [4] took the opposite approach: only the decoder, with causal self-attention.

Decoder (12-96+ layers):
  Input: Text sequence
  Attention: Causal — each token sees only previous tokens
  Training: Next-token prediction
  Output: Probability of next token

No encoder, no cross-attention. In fact, GPT-style decoders omit both the encoder and the cross-attention sublayer that was present in the original transformer decoder. Each GPT block has just two sublayers — self-attention and FFN — versus the three in the original decoder (self-attention + cross-attention + FFN). This simplification is a deliberate design choice: without an encoder, there’s nothing to cross-attend to.

This is what GPT-1/2/3/4, Claude, Gemini, LLaMA 2/3, Qwen, DeepSeek, and essentially all modern LLMs use.

Why decoder-only won

Three reasons:

  1. Simplicity: One stack instead of two. Fewer architectural decisions. Easier to scale.

  2. Universality: Next-token prediction turns out to be a remarkably general objective. By predicting text, the model implicitly learns to understand it, reason about it, and generate it. One architecture does everything.

  3. Scaling: When you double the parameters, decoder-only models consistently get better at all tasks. This scaling predictability made the bet on “just make it bigger” viable.

Encoder-decoder models still exist for specific tasks (translation, summarization), and encoder-only models are great for embeddings and classification. But for the general-purpose AI assistants that programmers interact with daily, decoder-only is the dominant paradigm.

Architecture comparison at a glance

ArchitectureExample ModelsAttentionBest For
Encoder-DecoderOriginal Transformer [1], T5 [3], BARTBidirectional (encoder) + Causal + Cross (decoder)Translation, summarization
Encoder-OnlyBERT [2], RoBERTa, DeBERTaBidirectionalClassification, embeddings, NER
Decoder-OnlyGPT-1/2/3/4 [4], LLaMA, Claude, GeminiCausal (masked)General-purpose generation, reasoning, chat

Practical implications

  • If you need embeddings or classification: Encoder-only models (BERT, sentence-transformers) are still the best choice — they’re faster, cheaper, and produce better representations for retrieval tasks.
  • If you need text generation: Decoder-only is the default. Every major LLM API (OpenAI, Anthropic, Google) uses decoder-only models.
  • If you need structured input→output mapping: Encoder-decoder models like T5 can be effective for tasks with clear input/output structure (translation, structured extraction), though decoder-only models have largely subsumed these tasks at scale.

References

  1. Vaswani, A., et al. “Attention Is All You Need.” NeurIPS, 2017. arXiv:1706.03762
  2. Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL, 2019. arXiv:1810.04805
  3. Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” JMLR, 2020. arXiv:1910.10683
  4. Radford, A., et al. “Improving Language Understanding by Generative Pre-Training.” OpenAI, 2018. PDF

In the next section, we’ll trace our running example through a decoder-only transformer — start to finish.

Summary: The original transformer used an encoder-decoder architecture for translation. BERT uses encoder-only for understanding. GPT uses decoder-only for generation. Decoder-only won for modern LLMs because it's simpler, scales better, and does everything well enough — generating, understanding, and reasoning all emerge from next-token prediction. In the next section, we'll trace our running example through a complete decoder-only transformer from start to finish.

Walkthrough — “The cat sat on the mat.” Through a GPT-Style Transformer

You’ve learned every piece of the transformer. Now let’s see them all work together. We’ll trace “The cat sat on the mat.” through a GPT-style decoder-only transformer, from raw text to next-token prediction, step by step.

Use the interactive demo to walk through each stage:

Transformer Walkthrough: Decoder-Only (GPT-style)

Step through each stage of a GPT-style transformer processing "The cat sat on the mat."

1

Step 1 of 11

Step 1 of 11: .

Stage 1: Tokenization

The sentence enters as a string. The tokenizer splits it into subword tokens and maps each to an integer ID:

"The cat sat on the mat."
→ ["The", " cat", " sat", " on", " the", " mat", "."]
→ [464, 3797, 3332, 319, 262, 2603, 13]

Seven tokens. This sequence of integers is all the model will ever see — it has no access to the original characters.

Stage 2: Embedding lookup

Each token ID indexes into the embedding table, producing a 768-dimensional vector (we use 8 dimensions in the demo for visibility). The result is a matrix of shape (7, 768).

These vectors carry semantic meaning that emerged from training — tokens that appear in similar contexts end up with similar vectors. But they don’t yet encode position.

Stage 3: Add positional encoding

Each token’s embedding gets a position-dependent vector added to it. Now the model knows that “cat” is at position 1 and “mat” is at position 5. The matrix is still (7, 768), but each vector now encodes both meaning and position.

Stage 4: Causal self-attention

This is where the magic happens. Each token computes attention over all previous tokens (including itself), but NOT future tokens — the causal mask ensures this.

When processing “sat” (position 2):

  • Can attend to: “The” (pos 0), “cat” (pos 1), “sat” (pos 2)
  • Cannot attend to: “on” (pos 3), “the” (pos 4), “mat” (pos 5), ”.” (pos 6)

The model computes Q, K, V from the same input (three projections of X), applies scaled dot-product attention with the causal mask, and multi-head attention runs this across 12+ heads in parallel. Each head learns a different attention pattern.

After attention, each token’s vector has been enriched with context from all allowed positions. “sat” now carries information about “cat” (its subject).

Stage 5: Feed-forward network

Each token’s vector independently passes through the FFN:

  • Expand: 768 → 3072 (4× wider)
  • GELU activation
  • Project back: 3072 → 768

Residual connections add the input back (x + ffn(x)), and layer normalization keeps the values stable.

Stage 6: Repeat for many layers

This is one transformer block. Real models repeat it:

  • GPT-2: 12 times
  • GPT-3: 96 times
  • GPT-4: ~120 times

Each layer refines the representation. Early layers capture syntax and local patterns. Later layers capture semantics, entity tracking, and reasoning. The shapes never change — every block takes (7, 768) and outputs (7, 768).

Stage 7: Output projection

After all layers, we take the vector at the last position (position 6, ”.” — the final token in our input) and project it to vocabulary size:

logits = final_vector @ W_output  # (768,) → (50257,)
probs = softmax(logits)           # 50257 probabilities summing to 1

The model produces a probability for every token in its vocabulary. We are predicting the 8th token — what comes after the complete 7-token input. The period in our input (token 6) is already consumed; the model is now asking “given this full sentence, what comes next?” The highest-probability tokens might be:

TokenProbability
.0.31
It0.12
The0.09
\n0.07
He0.04

The predicted ”.” here is GPT-2 token 13, which appears frequently after one sentence ends before another begins in the training data (e.g., “The cat sat on the mat. The dog lay on the rug.”). It is not the period already in the input — that period was the 7th input token, part of the sequence the model was given.

Stage 8: Sample the next token

The model selects ”.” as the predicted next token (highest probability). This token is appended to the sequence, producing 8 tokens total.

Stage 9-11: The autoregressive loop

Now the model has: ["The", " cat", " sat", " on", " the", " mat", "."]

It runs the entire pipeline again — embed, add positions, attention (now an 8×8 matrix), FFN, repeat layers — and predicts the next token after ”.”. Let’s say it predicts “It”.

Now the model has 8 tokens. It runs again, predicts “was”. Then 9 tokens, predicts “a”. And so on:

"The cat sat on the mat."      → predict "."   (→ correct!)
"The cat sat on the mat. "     → predict "It"
"The cat sat on the mat. It"   → predict "was"
"The cat sat on the mat. It was" → predict "a"
...

Each step requires a full forward pass through all layers. Each step’s attention matrix grows by one row and one column (8×8, then 9×9, then 10×10…). This is why generation is inherently sequential and why longer outputs are slower.

We’ll cover the generation loop, KV caching, and performance implications in detail in the inference section.

The connection to your experience

This exact process happens every time you send a message to ChatGPT, ask Claude a question, or accept a Copilot suggestion. The same pipeline: tokenize → embed → add positions → attention + FFN × N layers → project to vocabulary → sample → repeat.

The only differences at scale:

  • More layers: 96–120 instead of our illustrative 3
  • Wider vectors: 8,192–12,288 dimensions instead of our 8
  • More heads: 64–96 attention heads
  • Bigger vocabulary: 100k–150k tokens instead of 50k
  • Longer context: 8k–128k tokens instead of 7
  • Billions of parameters: The weight matrices are enormous

But the architecture is identical. You now understand the fundamental mechanism behind every modern language model.

Summary: This is the payoff. Every concept from Parts II–IV applied to a real example: tokenize 'The cat sat on the mat.' → embed → add positions → masked self-attention → FFN → repeat layers → project to vocabulary → predict '.' → append and repeat. This same process runs every time you use ChatGPT, Claude, or Copilot. Next, we'll look at how these models learn to do this through pre-training.

Pre-training — How Transformers Learn Language at Scale

You now understand the transformer’s architecture — what happens during a single forward pass. But how does the model learn to do this well? How do random weight matrices turn into a system that understands language?

The training objective: next-token prediction

Pre-training is conceptually simple. Take an enormous corpus of text (books, websites, code, conversations), and for every position in every document, ask the model: “Given the tokens so far, what comes next?”

# Pre-training pseudocode
for document in training_corpus:      # Billions of documents
    tokens = tokenize(document)
    for i in range(1, len(tokens)):
        input_sequence = tokens[:i]   # Everything before position i
        target = tokens[i]            # The actual next token
        
        predicted_probs = model(input_sequence)  # Forward pass
        loss = cross_entropy(predicted_probs, target)
        update_weights(loss)          # Make the model slightly better

That’s it. No labeled data, no human annotations, no explicit grammar rules. Just “predict the next word” on an internet-scale dataset. The model sees:

  • “The cat sat on the” → should predict “mat” (or “floor”, “rug”, etc.)
  • “def fibonacci(n):” → should predict “\n” or “return” or something syntactically valid
  • “The capital of France is” → should predict “Paris”

Through billions of these examples, the model implicitly learns syntax, semantics, facts, reasoning patterns, code structure, and more. It doesn’t memorize rules — it learns statistical structure.

What the model gains from pre-training

After pre-training on trillions of tokens, the model develops:

  • Syntax: It knows that “The cat” is more likely than “Cat the”
  • Semantics: It knows that “Paris” follows “The capital of France is”
  • World knowledge: It has absorbed facts from the training data
  • Reasoning patterns: It can follow chains of logic (to a degree)
  • Code understanding: It knows Python syntax, common patterns, library APIs
  • Multilingual ability: If trained on multiple languages, it can transfer between them

None of this was explicitly programmed. It all emerged from next-token prediction at scale.

How learning actually works: loss, gradients, and backpropagation

A programmer might wonder: “OK, it predicts tokens. But how does it get better?” Here’s the feedback loop, explained without calculus.

Step 1: Measure the error (cross-entropy loss)

After the model produces a probability distribution over the vocabulary, we compare it to the actual next token. The cross-entropy loss measures how surprised the model was by the correct answer:

# If the model predicted P("mat") = 0.01 (1% chance)
# but "mat" was the correct answer...
loss = -log(0.01)  # = 4.6  (high loss — the model was very wrong)

# If the model predicted P("mat") = 0.85 (85% chance)
loss = -log(0.85)  # = 0.16 (low loss — good prediction)

Lower loss = better prediction. The goal of training is to minimize this loss across all examples.

Step 2: Compute which weights to adjust (backpropagation)

The model is a computation graph — data flows forward through embeddings, attention, FFN, output head. Backpropagation flows the error backward through this graph, computing for each weight: “if I increase this weight slightly, does the loss go up or down, and by how much?”

Think of it like tracing an error in a pipeline: “The output was wrong. The output head contributed this much error. That came from the last FFN, which got its input from attention, which got its input from embeddings…” Each weight gets a gradient — a number that says which direction to adjust.

Step 3: Adjust the weights (gradient descent)

# For every weight in the model:
weight = weight - learning_rate * gradient

If the gradient says “increasing this weight would increase the loss,” we decrease the weight (and vice versa). The learning_rate controls how big each adjustment is — typically very small (1e-4 to 1e-5).

The programmer analogy: Think of it as an optimization loop, like a compiler optimization pass. Run the program forward, measure how far the output is from the target, then propagate corrections backward through the computation graph. Repeat billions of times.

After enough iterations (often weeks or months on thousands of GPUs), the random initial weights converge to values that make the model very good at predicting text.

Scale of pre-training

Modern pre-training is expensive:

AspectScale
Training data1-30+ trillion tokens
ComputeThousands of GPUs for weeks-months
Cost$5M-$500M+ per training run
Model size7B-1.8T+ parameters

Training data scale by model

The progression of training data scale reveals how rapidly the field has grown [1][2][3]:

ModelYearTraining TokensData SizeEstimated Compute Cost
GPT-2 [4]2019~10B tokens40GB (WebText)~$50K
GPT-3 [1]2020300B tokens~570GB~$4.6M
Chinchilla [3]20221.4T tokens~5.6TB~$3M (70B params)
LLaMA 1 [2]20231.4T tokens~4.8TB~$2-5M
LLaMA 2 [5]20232T tokens~7TB~$5-10M
LLaMA 3 [6]202415T+ tokens~50TB+~$50-100M+
DeepSeek-V3 [7]202414.8T tokensUndisclosed~$5.6M (2,048 H800 GPUs)
Qwen 2.5 [8]202418T+ tokensUndisclosedUndisclosed
Llama 4 [9]2025~30T tokens (est.)UndisclosedUndisclosed

The Chinchilla scaling laws [3] were a turning point: Hoffmann et al. showed that most models were undertrained — using too many parameters for the amount of data. A model trained with the optimal token-to-parameter ratio (roughly 20 tokens per parameter) outperforms a larger model trained on fewer tokens. This insight drove the industry to scale training data dramatically. By 2024-2025, the pendulum swung further: DeepSeek-V3 trained its 671B MoE model on 14.8T tokens for approximately $5.6M — a fraction of comparable frontier models — by using engineering innovations like FP8 mixed-precision training and multi-token prediction objectives.

Practical implications

Why fine-tuning is cheaper than pre-training: Pre-training a frontier model requires processing trillions of tokens through billions of parameters — thousands of GPUs running for months at costs of $10M–$100M+. Fine-tuning, by contrast, trains on thousands to millions of examples (not trillions), often modifies only a subset of weights (see LoRA), and can run on a single GPU in hours. A rough cost comparison:

StageDataComputeTimeCost
Pre-training (70B model)2T tokens2,000+ GPUs3-6 months$5-50M
Supervised fine-tuning50K-500K examples8-64 GPUsHours-days$100-$10K
LoRA fine-tuning10K-100K examples1-8 GPUsHours$10-$1K

This is why most developers never pre-train a model — they start with an existing pre-trained model and customize it for their use case.

References

  1. Brown, T., et al. “Language Models are Few-Shot Learners.” NeurIPS, 2020. arXiv:2005.14165
  2. Touvron, H., et al. “LLaMA: Open and Efficient Foundation Language Models.” Meta AI, 2023. arXiv:2302.13971
  3. Hoffmann, J., et al. “Training Compute-Optimal Large Language Models.” NeurIPS, 2022. arXiv:2203.15556
  4. Radford, A., et al. “Language Models are Unsupervised Multitask Learners.” OpenAI, 2019. PDF
  5. Touvron, H., et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” Meta AI, 2023. arXiv:2307.09288
  6. Dubey, A., et al. “The Llama 3 Herd of Models.” Meta AI, 2024. arXiv:2407.21783
  7. DeepSeek-AI. “DeepSeek-V3 Technical Report.” 2024. arXiv:2412.19437
  8. Qwen Team. “Qwen2.5 Technical Report.” Alibaba, 2024. arXiv:2412.15115
  9. Meta AI. “Introducing Llama 4.” 2025. Meta AI blog.

The cost is why only a handful of organizations pre-train frontier models. Most developers use pre-trained models and customize them through fine-tuning — the topic of the next section.

Summary: Pre-training teaches a transformer to predict the next token on massive text corpora. The model doesn't memorize rules — it learns statistical structure through billions of examples. Cross-entropy loss measures prediction error; gradient descent adjusts weights to reduce it; backpropagation computes which weights to adjust. This iterative optimization process is how 'learning' works. In the next section, we'll see how post-training (fine-tuning and alignment) transforms this raw capability into a helpful assistant.

Fine-tuning, Instruction Tuning, and Alignment

A pre-trained model is powerful but unrefined. Ask it “What is the capital of France?” and it might continue with “What is the capital of Germany? What is the capital of Spain?” — because in its training data, quiz questions often appear in lists. It hasn’t learned to answer questions; it’s learned to predict text.

Post-training is the umbrella term for all training that happens after base pre-training. It encompasses supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and related alignment techniques. Post-training bridges the gap between a model that can predict text and one that follows instructions and behaves helpfully: where pre-training teaches the model what language is, post-training teaches it how to behave.

Supervised fine-tuning (SFT)

The first step is supervised fine-tuning: train the model on high-quality examples of desired behavior.

# SFT training examples
examples = [
    {"input": "What is the capital of France?", "output": "The capital of France is Paris."},
    {"input": "Write a Python function to sort a list.", "output": "def sort_list(lst):\n    return sorted(lst)"},
    {"input": "Summarize this article: ...", "output": "The article discusses..."},
]

for example in examples:
    loss = model.compute_loss(example["input"], example["output"])
    model.update_weights(loss)

SFT uses the same training mechanism as pre-training (next-token prediction, cross-entropy loss, backpropagation) but on curated instruction-response pairs instead of raw internet text.

After SFT, the model follows instructions — but it might still produce harmful content, be excessively verbose, or confidently state falsehoods.

Domain adaptation

A variant of fine-tuning: train on domain-specific data to specialize the model. A medical model fine-tuned on clinical notes, a legal model fine-tuned on case law, a code model fine-tuned on programming tasks.

The architecture doesn’t change. You’re just adjusting the weights toward a specific domain’s patterns.

RLHF: Reinforcement Learning from Human Feedback

RLHF [1] is the technique that turned raw language models into the helpful assistants you interact with. The InstructGPT paper [2] demonstrated the approach at scale, showing that a 1.3B parameter model with RLHF could outperform a 175B parameter model without it (on human preference evaluations). It’s a three-step process:

Step 1: Supervised fine-tuning (as above)

Train the model to follow instructions with curated examples.

Step 2: Train a reward model

Collect human preferences: show humans two model responses to the same prompt and ask “which is better?” Use this data to train a reward model — a separate model that predicts how much a human would prefer a given response.

# Reward model training
for (prompt, response_a, response_b, human_preferred) in preference_data:
    reward_a = reward_model(prompt, response_a)
    reward_b = reward_model(prompt, response_b)
    # Train so that the preferred response gets a higher score

Step 3: Optimize with PPO (Proximal Policy Optimization)

Use the reward model to guide the language model. Generate responses, score them with the reward model, and adjust the language model to produce higher-scoring responses:

for prompt in training_prompts:
    response = language_model.generate(prompt)
    reward = reward_model.score(prompt, response)
    language_model.update_with_rl(reward)  # Increase probability of high-reward responses

A KL divergence penalty constrains the policy to prevent the model from deviating too far from the SFT model (to avoid “reward hacking” — gaming the reward model without actually improving).

The key insight

Pre-training creates capability. Post-training shapes behavior.

Pre-training teaches the model what language is — syntax, semantics, facts, patterns. Post-training teaches the model how to behave — be helpful, be honest, follow instructions, refuse harmful requests.

This separation is why the same base model (e.g., LLaMA) can be fine-tuned into a coding assistant, a medical advisor, or a creative writing partner. The capability is already there; post-training just directs it.

Alternatives to RLHF

RLHF is effective but expensive and complex. Alternatives include:

  • DPO (Direct Preference Optimization) [3]: Skips the reward model entirely, optimizes the language model directly on preference pairs using a classification-style loss — simpler to implement, more stable to train, and widely adopted by 2025
  • GRPO (Group Relative Policy Optimization) [6]: Used by DeepSeek-R1 (2025). Eliminates the separate value function required by PPO by computing advantages relative to a group of sampled outputs, further reducing the complexity and cost of RL-based alignment
  • Constitutional AI: The model critiques and revises its own outputs using principles
  • RLAIF: Use an AI model (instead of humans) to provide feedback

These are active research areas. The core principle remains: use feedback to align the model’s behavior with human intent.

Reasoning models: alignment through reinforcement learning

A major development in 2024-2025 is the emergence of reasoning models — models trained to produce explicit chain-of-thought reasoning before giving a final answer. The key insight is that the standard transformer architecture, combined with RL-based post-training, can learn to “think” step by step at inference time.

How reasoning models are trained: Rather than training on curated instruction-response pairs, reasoning models like DeepSeek-R1 [6] and OpenAI’s o1/o3 [7] use reinforcement learning to reward correct final answers. The model learns on its own that producing intermediate reasoning steps leads to better outcomes. DeepSeek-R1 demonstrated that pure RL (using GRPO) applied to a base model — without any supervised fine-tuning — can produce emergent reasoning behaviors including self-verification and reflection.

Test-time compute scaling: Traditional scaling focuses on making models bigger or training them longer. Reasoning models introduce a third axis: spending more compute at inference time. By generating longer chains of thought, the model effectively “thinks harder” about difficult problems. OpenAI’s o3 scored 96.7% on AIME 2024 math problems — a dramatic improvement over non-reasoning models — by using significantly more tokens during generation.

This represents a shift in how post-training creates capability: instead of just shaping behavior (be helpful, be safe), RL can teach models new cognitive strategies that emerge from the training signal rather than being explicitly programmed.

Practical implications

When to fine-tune vs. prompt engineering: Fine-tuning is not always the right answer. Consider this decision framework:

ApproachBest WhenCostEffort
Prompt engineeringYou need flexibility, task is expressible in instructionsAPI costs onlyLow
Few-shot promptingYou have good examples, task is consistentSlightly higher token usageLow
Full fine-tuningYou have large domain-specific datasets, need maximum qualityHigh (GPU compute)High
LoRA/QLoRA [4]You want fine-tuning benefits at lower cost, limited GPU memoryModerateModerate

LoRA (Low-Rank Adaptation) [4] makes fine-tuning practical for most teams. Instead of updating all model weights, LoRA freezes the pre-trained weights and injects small trainable rank-decomposition matrices into each layer. This reduces trainable parameters by orders of magnitude — roughly 1,000× to 10,000× depending on model size and LoRA rank — while achieving comparable quality:

# Full fine-tuning: update all 7B parameters
trainable_params = 7_000_000_000  # 7B, requires massive optimizer state

# LoRA: update only ~0.1% of parameters
trainable_params = 4_194_304      # ~4M, fits on a single GPU

QLoRA [5] goes further by quantizing the frozen weights to 4-bit, enabling fine-tuning of a 65B model on a single 48GB GPU.

Cost comparison for fine-tuning a 7B model:

MethodGPU MemoryTime (10K examples)Approximate Cost
Full fine-tuning80GB+ (A100)2-4 hours$50-200
LoRA16-24GB1-2 hours$10-50
QLoRA10-16GB1-3 hours$5-30

References

  1. Christiano, P., et al. “Deep Reinforcement Learning from Human Preferences.” NeurIPS, 2017. arXiv:1706.03741
  2. Ouyang, L., et al. “Training language models to follow instructions with human feedback.” NeurIPS, 2022. arXiv:2203.02155
  3. Rafailov, R., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS, 2023. arXiv:2305.18290
  4. Hu, E., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR, 2022. arXiv:2106.09685
  5. Dettmers, T., et al. “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS, 2023. arXiv:2305.14314
  6. DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025. arXiv:2501.12948
  7. OpenAI. “Learning to Reason with LLMs.” OpenAI blog, 2024.

With the model trained and aligned, the final question is: what happens when you actually use it?

Summary: Pre-training creates raw capability — the model can predict text but doesn't know how to follow instructions or be helpful. Post-training shapes behavior: supervised fine-tuning teaches the model to follow instructions, and RLHF aligns it with human preferences. Pre-training creates capability; post-training shapes behavior. In the next section, we'll see what happens when you actually use the finished model — the inference process.

Inference — What Happens When You Actually Use the Model

When you type a prompt into ChatGPT and hit Enter, here’s what happens.

The generation loop

Remember the walkthrough? That forward pass happens for every single output token:

def generate(prompt, max_tokens=100):
    tokens = tokenize(prompt)
    
    for _ in range(max_tokens):
        # 1. Forward pass through all layers
        logits = model.forward(tokens)  # Full transformer pipeline
        
        # 2. Apply sampling strategy
        next_token = sample(logits, temperature, top_k, top_p)
        
        # 3. Append and repeat
        tokens.append(next_token)
        
        # 4. Stop if we hit the end token
        if next_token == END_TOKEN:
            break
    
    return detokenize(tokens)

Each iteration is a complete forward pass through every layer. If the model has 96 layers and the response is 200 tokens, that’s 96 × 200 = 19,200 layer computations — just for one response.

Why generation is slower than training

During training, all tokens in a sequence are processed simultaneously. The model sees “The cat sat on the mat” all at once and computes loss for every position in parallel.

During generation, each new token depends on the previous tokens. You can’t predict token 5 until token 4 exists. This is inherently sequential — you’re bound by latency, not throughput.

This is why:

  • Prompting (the first response) has some latency (the “prefill” — processing your entire prompt)
  • Each subsequent token streams in one at a time
  • Longer responses take proportionally longer

The KV cache: avoiding redundant computation

Naive generation is wasteful. When generating token 101, the model would recompute attention for tokens 1–100 even though nothing about them has changed.

The KV cache solves this. During each forward pass, the model caches the Key and Value matrices for every layer:

# Without KV cache: recompute everything each step
for step in range(max_tokens):
    full_output = model(all_tokens)      # Processes ALL tokens every step

# With KV cache: only compute the new token's contribution
kv_cache = {}
for step in range(max_tokens):
    new_output = model(new_token, kv_cache)  # Process ONLY the new token
    kv_cache.update(new_keys, new_values)     # Append to cache

The KV cache turns each generation step from O(n) to O(1) in terms of sequence processing (though the cache itself grows). This is a massive speedup — but it uses memory proportional to num_layers × 2 × seq_len × d_model.

KV cache memory calculation example: For Llama 3 8B with a 128K context window:

KV cache = 2 (K and V) × n_layers × n_kv_heads × d_head × seq_len × 2 bytes (FP16)
         = 2 × 32 × 8 × 128 × 131,072 × 2
         = ~16 GB per sequence

(Note: Llama 3 8B uses GQA with 8 KV heads — not one per query head. See the attention section for why GQA reduces this number so dramatically compared to standard multi-head attention.)

That’s 16 GB of GPU memory just for the KV cache of a single request. Batching 8 concurrent requests would require ~128 GB of KV cache memory alone — before accounting for model weights. This is why efficient KV cache management is critical for serving [1].

Prefill vs. decode: two different bottlenecks

Generation has two distinct phases with different performance characteristics:

PhaseWhat HappensBottleneckSpeed
PrefillProcess all input tokens in parallelCompute-bound (matrix math)Fast — limited by GPU FLOPS
DecodeGenerate one token at a timeMemory-bound (loading weights)Slow — limited by memory bandwidth

During prefill, the GPU processes your entire prompt at once — thousands of tokens in a single forward pass. This is compute-bound: the GPU’s matrix multiplication units are fully utilized.

During decode, the model generates one token per step. Each step requires loading all model weights from memory but only performs a small amount of computation (one token’s worth). The GPU is underutilized — it spends most of its time waiting for memory transfers. This is why you see a pause before the first token (prefill), then tokens streaming in one at a time (decode).

Why long contexts cost more

As discussed in the masking section, attention is O(n²). But there’s also the KV cache:

  • A 4k context stores 4,096 K/V pairs per layer
  • A 128k context stores 131,072 K/V pairs per layer

To illustrate why attention variant matters for long contexts, compare GPT-3 175B (MHA, 96 KV heads) versus a GQA model (8 KV heads) at 128K context:

GPT-3 style (MHA, 96 layers, 96 KV heads, d_head=128, FP16):
  KV cache = 2 × 96 × 96 × 128 × 131,072 × 2 bytes ≈ 309 GB per sequence

GQA model (96 layers, 8 KV heads, d_head=128, FP16):
  KV cache = 2 × 96 × 8 × 128 × 131,072 × 2 bytes ≈ 26 GB per sequence

This 12× difference is exactly why models like Llama 3 and Qwen use GQA with 8 KV heads — it makes long-context serving feasible. With MHA, a single 128K-context request would consume more GPU memory for KV cache alone than the model weights themselves.

This is why API providers charge more for longer contexts, why responses get slower as conversations grow, and why there’s active research into KV cache compression, sparse attention, and sub-quadratic alternatives.

Sampling strategies: temperature, top-k, and top-p

After the forward pass produces logits (one score per vocabulary token), we need to select the next token. The model doesn’t always pick the highest-probability token — that would make responses repetitive and deterministic. Instead, we sample from the distribution, and three parameters control how:

Temperature scales the logits before softmax:

scaled_logits = logits / temperature
probs = softmax(scaled_logits)
  • temperature < 1 → sharper distribution → more deterministic (peaks get peakier)
  • temperature = 1 → unchanged distribution (the default)
  • temperature > 1 → flatter distribution → more random (everything becomes more likely)
  • temperature → 0 → greedy decoding (always pick the highest-probability token)

Top-K keeps only the K highest-probability tokens, setting the rest to zero:

top_k_tokens = sorted(probs, descending=True)[:k]
# Renormalize: only sample from these K tokens

Top-P (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds P:

sorted_probs = sorted(probs, descending=True)
cumulative = cumsum(sorted_probs)
cutoff_idx = first_index_where(cumulative >= p)
# Keep only tokens up to cutoff_idx, renormalize

Top-P is adaptive: when the model is confident (one token has 90% probability), only 1-2 tokens pass the filter. When the model is uncertain, more tokens pass.

Explore sampling parameters

Adjust the sliders to see how temperature, top-k, and top-p change the probability distribution:

Sampling Strategy Explorer

Adjust temperature, top-k, and top-p to see how the next-token distribution changes after "The cat sat on the ___"

matrugfloorcarpetbedtablegroundcouchchairsofawallroofdoorwindowsky53.1%13.1%9.7%7.2%4.8%3.6%2.9%1.8%1.3%1.0%0.7%0.4%0.2%0.2%0%
moderate distribution. 15 of 15 tokens active. Most likely: "mat" at 53.1%.
Temperature: 1.00Top-K: 15 (all)Top-P: 1.00 (all)Active tokens: 15/15
Most likely: "mat" at 53.1%

Temperature scales logits before softmax — low values sharpen the distribution toward the top token, high values flatten it. Top-K keeps only the K highest-probability tokens. Top-P (nucleus sampling) keeps the smallest set of tokens whose cumulative probability exceeds p. Grayed-out tokens with strikethrough are filtered and renormalized to zero.

Sampling Parameters

These are the exact parameters you set in API calls:

  • temperature=0.7 — slightly more focused than default
  • top_p=0.9 — ignore the long tail of unlikely tokens
  • top_k=50 — hard cap on candidate tokens

Understanding what they do lets you tune generation quality for your use case: lower temperature for factual tasks, higher for creative writing.

The full picture

Every time you interact with an LLM:

  1. Your prompt is tokenized
  2. A prefill forward pass processes all input tokens in parallel
  3. The model enters the generation loop: forward pass → sample → append → repeat
  4. The KV cache avoids recomputing past tokens
  5. Temperature/top-k/top-p control the randomness of each token choice
  6. This continues until the model generates a stop token or hits the max length

Practical implications

GPU memory estimation for serving: When planning deployment, account for both model weights and KV cache:

Total GPU memory ≈ model_weights + kv_cache_per_seq × batch_size

Example: Llama 3 8B at FP16 with batch_size=4, 4K context:
  Model weights: ~16 GB
  KV cache: ~0.5 GB × 4 = ~2 GB
  Total: ~18 GB (fits on a single A100 40GB)

Same model at 128K context:
  Model weights: ~16 GB
  KV cache: ~16 GB × 4 = ~64 GB
  Total: ~80 GB (needs multiple GPUs or A100 80GB + offloading)

Streaming response behavior: The pause before the first token is prefill time (processing your entire prompt). After that, tokens arrive at a steady rate determined by decode speed. Longer prompts increase time-to-first-token but don’t affect per-token decode speed.

Batch size trade-offs: Larger batches increase throughput (more tokens/second across all requests) but increase latency per request. Systems like vLLM [1] use PagedAttention to manage KV cache memory efficiently, enabling larger batch sizes without wasting GPU memory on fragmented allocations.

Modern inference optimizations

The gap between raw model capability and practical deployment has driven major advances in inference efficiency through 2024-2025:

FlashAttention-3 [2] (2024) builds on FlashAttention-2 by exploiting GPU hardware asynchrony and low-precision (FP8) computation. Rather than materializing the full N x N attention matrix in GPU memory, FlashAttention computes attention in tiles directly from SRAM, reducing memory usage from O(N^2) to O(N) and improving wall-clock speed by 1.5-2x over standard attention. FlashAttention-3 further overlaps computation with memory transfers using asynchronous operations, approaching the theoretical peak throughput of modern GPUs.

Speculative decoding [3] addresses the fundamental bottleneck of autoregressive generation: each token requires a full forward pass. The idea is simple — use a small, fast “draft” model to generate several candidate tokens, then verify them in parallel with the large model:

# Speculative decoding pseudocode
draft_tokens = small_model.generate(prompt, n=5)  # Fast: generate 5 candidates
# Verify all 5 in a single forward pass of the large model
accepted = large_model.verify(prompt + draft_tokens)  # Parallel verification
# Accept matching tokens, reject and regenerate from first mismatch

When the draft model’s predictions match the large model (which happens frequently for predictable tokens), multiple tokens are produced per large-model forward pass. Techniques like EAGLE-3 (2025) and Saguaro (2025) have pushed acceptance rates higher, achieving 2-3x speedups over optimized non-speculative decoding for typical workloads.

Quantization: Running models in lower precision dramatically reduces memory and speeds up inference. FP16 (half-precision) is now the baseline. INT8 and INT4 quantization (via GPTQ, AWQ) cut memory by another 2-4x with modest quality loss. NVIDIA’s TensorRT-LLM supports FP8 natively on H100 GPUs, and NVFP4 (4-bit floating point) is emerging for even more aggressive compression.

Serving frameworks: vLLM [1] with PagedAttention has become the standard open-source serving stack, reducing KV cache memory waste to under 4% through virtual memory-style paging. Combined with continuous batching (dynamically adding requests as others complete), modern serving systems achieve much higher throughput than naive implementations.

References

  1. Kwon, W., et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP, 2023. arXiv:2309.06180
  2. Shah, J., et al. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.” NeurIPS, 2024. arXiv:2407.08691
  3. Leviathan, Y., et al. “Fast Inference from Transformers via Speculative Decoding.” ICML, 2023. arXiv:2211.17192

Now you understand what’s happening inside the model. The final section explores why different models — built on the same transformer architecture — produce such different results.

Summary: When you use a language model, it runs the generation loop: tokenize your prompt, run a forward pass, sample a token, append it, and repeat. The KV cache avoids redundant computation. Generation is inherently sequential and slower than training. Temperature, top-k, and top-p control the randomness of token selection — these are the parameters you set in every API call. In the next section, we'll compare how different models built on this same architecture produce such different results.

Why GPT, Claude, Gemini, Qwen, and DeepSeek Differ

Every major language model — GPT-4, Claude, Gemini, Qwen, DeepSeek, LLaMA — is built on the decoder-only transformer architecture you’ve learned in this article. Same attention mechanism. Same feed-forward networks. Same residual connections. Same autoregressive generation.

So why do they produce such different results?

Same core, different everything else

The transformer architecture is the engine. Everything else — the fuel, the tuning, the chassis — varies:

FactorWhat It IsWhy Programmers Care
TokenizerBPE vocabulary (50k-200k+ tokens), handling of code/math/multilingual textAffects token counts, cost per request, handling of special characters
Training dataWeb crawls, books, code, proprietary data. Quality and filtering matter enormouslyDetermines knowledge cutoff, coding ability, language coverage
Scale7B to 1.8T+ parameters, 1T to 30T+ training tokensBigger isn’t always better for your task — smaller models are faster and cheaper
Context window4k to 10M tokensDetermines how much code/documentation you can send in one request
Architecture tweaksMoE (Mixture of Experts), GQA (Grouped Query Attention), MLA (Multi-head Latent Attention), sliding windowAffects speed, cost, and quality tradeoffs
Fine-tuning recipeSFT data quality, RLHF vs DPO vs GRPO, constitutional AI, reasoning RLShapes helpfulness, safety, instruction-following quality
Reasoning capabilityChain-of-thought at inference time, test-time compute scalingSome models can “think harder” on difficult problems — at the cost of more tokens and latency
Alignment policyWhat the model refuses, how cautious it is, how it handles edge casesDetermines usability — too cautious = frustrating, too lax = risky
Tool integrationFunction calling, code execution, web browsing, file handlingAffects what the model can do beyond text generation
Inference stackQuantization, batching, caching, speculative decodingAffects latency, cost, and throughput you experience
Product goalsGeneral assistant vs. coding vs. research vs. enterpriseShapes the model’s strengths and weaknesses

Notable architectural variations

Mixture of Experts (MoE): Used by Gemini, Mixtral, DeepSeek-V3, Llama 4 Scout/Maverick, Qwen 3, Mistral Large 3, and likely GPT-4. Instead of one FFN per layer, MoE has multiple “expert” FFN networks. A gating mechanism routes each token to top-k experts (out of 8-256 total). This allows much larger total model capacity without proportional compute cost. The scale of MoE has grown dramatically: DeepSeek-V3 uses 256 experts with top-8 routing (671B total, 37B active), while Llama 4 Maverick uses 128 experts (400B total).

Multi-head Latent Attention (MLA): Introduced by DeepSeek-V2 and used in V3. Instead of caching full Key and Value matrices, MLA compresses them into a low-rank “latent” representation, dramatically reducing KV cache size. This enables longer context windows with less memory overhead than standard GQA.

Grouped Query Attention (GQA): Used by LLaMA 2/3, Mistral, Qwen. Instead of separate K/V heads for each attention head, multiple query heads share the same K/V heads. This reduces the KV cache size (important for long contexts) with minimal quality loss.

Sliding Window Attention: Used by Mistral. Each token only attends to the previous N tokens (e.g., 4,096) instead of the full sequence. Reduces the O(n^2) cost to O(n x window_size). Combined with layer stacking, information can still propagate across the full context.

Model family snapshots

GPT (OpenAI): The GPT family has evolved rapidly through 2024-2025. GPT-4 (likely MoE, ~1.8T parameters) was followed by GPT-4.1 (1M context, April 2025) and GPT-5 (August 2025). OpenAI also introduced reasoning models: o1 (December 2024), o3, and o4-mini, which use chain-of-thought reasoning at inference time to achieve dramatically better results on math and coding tasks. Proprietary training data. RLHF alignment. Function calling and code interpreter.

Claude (Anthropic): Decoder-only, trained with Constitutional AI (a self-critique alignment method). Claude 3.5 Sonnet (2024) was followed by Claude 3.7 Sonnet (February 2025, with “extended thinking” mode for reasoning), then the Claude 4 series (Haiku/Sonnet/Opus, 2025) with up to 1M context. Known for strong coding, instruction following, and safety. Claude 4.5 Sonnet and Opus 4.6 arrived in late 2025 and early 2026.

Gemini (Google): Multimodal from the ground up (text, image, audio, video). Uses MoE architecture. Gemini 2.0 Flash (December 2024) and Gemini 2.5 Pro (March 2025, with “Deep Think” reasoning mode) were followed by the Gemini 3 series (late 2025). Up to 1M token context. Trained on Google’s proprietary data including YouTube, Scholar, Search.

Qwen (Alibaba): Open-weight model family that has grown from dense models (Qwen 2.5, up to 72B) to MoE architectures. Qwen 3 (April 2025, 235B total / 22B active, MoE) brought hybrid thinking modes (fast and reasoning). Strong multilingual performance (Chinese + English). Competitive with proprietary models.

DeepSeek (DeepSeek AI): Open-weight models with remarkable training efficiency. DeepSeek-V3 (December 2024, 671B MoE / 37B active) was trained for approximately $5.6M — a fraction of comparable frontier models. DeepSeek-R1 (January 2025) demonstrated that reasoning capabilities can emerge from pure RL training using GRPO, matching OpenAI o1 on many benchmarks while being open-source (MIT license). Uses Multi-head Latent Attention (MLA) for efficient KV caching.

LLaMA/Llama (Meta): Open-weight foundation models. Llama 3 (2024, up to 405B) used GQA and RoPE. Llama 4 (April 2025) shifted to MoE: Scout (109B total / 17B active, 16 experts, 10M context) and Maverick (400B total, 128 experts). Widely fine-tuned by the community. Llama 3 70B is competitive with GPT-3.5; Llama 4 models compete with frontier models.

Mistral (Mistral AI): French AI lab producing efficient open-weight models. Mistral Large 3 (December 2025, MoE, 41B active parameters) is released under Apache 2.0. Known for strong multilingual performance and efficiency.

xAI (Grok): Grok 3 (February 2025) is trained on data from the X platform. Known for a less restrictive alignment policy compared to other frontier models.

Detailed model comparison

The following table compares architectural details across major model families. Values reflect publicly available information as of early 2026; some details (especially for closed-source models) are estimates based on published research and community analysis.

FactorGPT-4.1 [1]Claude 4.5 Sonnet [2]Gemini 2.5 Pro [3]Llama 4 Scout [4]Qwen 3 [5]DeepSeek-V3 [6]
TokenizerBPE (o200k)BPESentencePieceBPE (tiktoken-style)BPEBPE
Vocab size~200K~100K~256K202K~152K128K
Attention typeUndisclosedUndisclosedMHA + Multi-QueryGQAGQAMulti-head Latent Attention (MLA)
Param countUndisclosedUndisclosedUndisclosed (MoE)109B (MoE, 17B active)235B (MoE, 22B active)671B (MoE, 37B active)
LayersUndisclosedUndisclosedUndisclosed48Undisclosed61
Context window1M1M1M10M128K128K
Training dataUndisclosedUndisclosedUndisclosedUndisclosed36T+ tokens14.8T tokens
FFN variantUndisclosedUndisclosedMoEMoE + SwiGLU (16 experts)MoE + SwiGLUMoE + SwiGLU (256 experts, top-8)
NormalizationUndisclosedUndisclosedRMSNorm (est.)RMSNorm (pre-norm)RMSNorm (pre-norm)RMSNorm (pre-norm)
Position encodingUndisclosedUndisclosedRoPE (est.)RoPERoPERoPE
Open/closedClosedClosedClosedOpen weightsOpen weightsOpen weights
Notable features1M context, tool use, code interpreterExtended thinking, computer use, artifactsDeep Think reasoning mode, native multimodal10M context, natively multimodalHybrid thinking (fast + reasoning modes)Multi-token prediction, MLA, cost-efficient training

Note: Closed-source model details marked “est.” or “undisclosed” are based on best available public information and may not be exact.

Reasoning models: a new category

A major development since 2024 is the emergence of reasoning models — models that generate explicit chains of thought before producing a final answer. These are the same transformer architecture but trained with reinforcement learning to produce intermediate reasoning steps.

ModelReleaseApproachKey Results
OpenAI o1Dec 2024RL-trained chain-of-thought, hidden reasoning tokensSignificant improvements on math, coding, science
OpenAI o3Early 2025Extended o1 approach, 200K context96.7% AIME 2024, 87.7% ARC-AGI
DeepSeek-R1Jan 2025Pure RL (GRPO) on base model, open-source (MIT)Matches o1 performance at ~27x lower cost
Claude 3.7 SonnetFeb 2025Extended thinking mode (optional reasoning)User-configurable thinking budget
Gemini 2.5 ProMar 2025Deep Think modeStrong math and reasoning with thinking toggle
OpenAI o4-mini2025Efficient reasoning model with vision93.4% AIME, tool use + reasoning
Qwen 3Apr 2025Hybrid thinking: fast mode + reasoning modeSwitches between fast and deep reasoning on demand

Why this matters for programmers: Reasoning models spend more tokens (and thus more time and money) on hard problems but can solve tasks that standard models cannot. When using an API, you’re essentially choosing between “fast and cheap” (standard models) and “slow and thorough” (reasoning models). Some providers let you control the thinking budget — more thinking tokens means better answers but higher cost and latency.

Known limits of transformers

Understanding the architecture also means understanding its constraints:

O(n²) attention cost: As discussed in the masking section, attention scales quadratically with sequence length. This is a fundamental limit that requires architectural innovations (sparse attention, linear attention, MoE routing) to work around.

Hallucination: The model generates text by sampling from probability distributions. It can produce confident-sounding text about things that aren’t true. There’s nothing in the architecture that distinguishes “factual claim” from “plausible-sounding sequence.” Hallucination is inherent to the generation mechanism.

Context window limits: Even with 1M-token windows, the model can only use information in its current context. It can’t access external databases, real-time data, or information not in the context (unless augmented with tools like RAG).

Training data dependence: The model only knows what it’s seen. Knowledge cutoff dates, biased training data, and gaps in coverage all affect output quality. A model trained mostly on English text will struggle with Swahili.

Limited reasoning architecture: Transformers have no symbolic manipulation engine, no persistent memory outside the current context window, and no guarantee of consistency across separate generations. Their outputs are probability distributions over tokens — not the result of formal logical inference. Reasoning models (o3, R1) mitigate this by spending more inference-time compute on chain-of-thought, but the underlying mechanism is still next-token prediction — the model is generating text that looks like reasoning, not performing formal logic. This means they can still fail on novel problem types, and longer thinking doesn’t always help.

These aren’t failures of specific models — they’re properties of the transformer architecture itself. Understanding them helps you use these tools more effectively: verify outputs, design prompts that provide context, and choose the right model for the task.

References

  1. OpenAI. “Introducing GPT-4.1.” 2025. OpenAI blog.
  2. Anthropic. “Claude 4.5 Sonnet.” 2025. anthropic.com
  3. Google DeepMind. “Gemini 2.5: Our most intelligent AI model.” 2025. Google blog.
  4. Meta AI. “Introducing Llama 4.” 2025. Meta AI blog.
  5. Qwen Team. “Qwen3 Technical Report.” Alibaba, 2025. arXiv:2505.09388
  6. DeepSeek-AI. “DeepSeek-V3 Technical Report.” 2024. arXiv:2412.19437
  7. DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025. arXiv:2501.12948
  8. OpenAI. “Learning to Reason with LLMs.” OpenAI blog, 2024.
Summary: All major LLMs use the same transformer core, but they differ in everything else: tokenizer, training data, scale, context window, architecture tweaks (MoE, GQA, MLA), fine-tuning recipes, alignment policy, tool integration, and product goals. Reasoning models add a new dimension — spending more compute at inference time to solve harder problems. These differences explain why models behave differently — and why programmers should care about more than just 'which model is biggest.'

Conclusion — What to Learn Next

You’ve now traveled the full path: from raw text to token IDs, through embeddings and positional encoding, into the attention mechanism, through feed-forward networks and residual connections, across stacked transformer blocks, and out through the output head to generated text. You’ve seen how models are trained, fine-tuned, aligned, and deployed. You understand why different models produce different results, and where the fundamental limits lie.

You know more about how these tools work than most programmers who use them every day.

What to do next

Here are four concrete next steps, in order of investment:

Study tokenizer behavior in your daily tools

Next time you use an LLM API, pay attention to token counts. Try the OpenAI Tokenizer or Hugging Face’s tokenizer playground. Notice:

  • How your code gets tokenized (each keyword, symbol, whitespace)
  • Why some prompts use more tokens than you’d expect
  • How different tokenizers handle the same text differently

Read the original paper

Attention Is All You Need (Vaswani et al., 2017) is the paper that started everything. With the understanding from this article, you can now read it and follow the architecture description. Focus on Sections 3.1–3.3 (attention mechanism) and the figures.

Implement a tiny attention module

The best way to internalize the architecture is to build it. Write a minimal attention function in Python/TypeScript:

import numpy as np

def attention(Q, K, V):
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    # Note: this softmax is numerically unstable for large scores.
    # Production implementations subtract the row max first (log-sum-exp trick):
    #   scores -= scores.max(axis=-1, keepdims=True)
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
    return weights @ V

# Test with small matrices
Q = np.random.randn(4, 8)
K = np.random.randn(4, 8)
V = np.random.randn(4, 8)
output = attention(Q, K, V)
print(output.shape)  # (4, 8)

Then add masking. Then add multi-head. Then add the FFN and residual connections. By the time you’ve built a minimal transformer block, the architecture will be second nature.

Look at real transformer code

Study production implementations:

  • nanoGPT by Andrej Karpathy — ~300 lines of clean PyTorch, the simplest complete GPT implementation
  • minGPT — slightly more structured, also by Karpathy
  • Hugging Face Transformers — production library, more complex but covers every model family
  • LLaMA model code — see how a real 70B model is structured

Further reading

Final thought

Transformers aren’t magic. They’re a well-designed computation pipeline: project text into vectors, let vectors attend to each other, transform them through feedforward layers, and predict the next token. The “intelligence” emerges from scale — billions of parameters, trillions of training tokens, and the remarkable fact that next-token prediction is sufficient to learn the structure of language. Even reasoning models, which appear to “think,” are generating tokens — they’ve simply learned that producing intermediate steps leads to better final answers.

Now that you understand the mechanism, you can use these tools with more confidence, debug unexpected behavior with more insight, and make better decisions about which model to use, how to prompt it, and what to trust in its output.

Summary: You now understand how transformers work — from tokenization to generation. Here's what to do next: read the original paper, implement a tiny attention module, study tokenizer behavior in your daily tools, and look at real transformer code.