What Every Programmer Should Know About CPU

An interactive guide to modern CPU microarchitecture

~20,000 words · 23 interactive demos · x86-64 · ARM · Apple Silicon

From Source Code to Execution

Every program you write, regardless of language, must eventually become a sequence of electrical signals toggling transistors inside a processor. The path from your editor to those transistors is a multi-stage transformation pipeline, and understanding each stage is fundamental to reasoning about performance.

The Compilation Pipeline

Consider a simple C function:

int sum(int *arr, int n) {
    int total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

This human-readable text passes through several distinct phases before the CPU can act on it.

Preprocessing and Compilation

The C preprocessor resolves #include directives, expands macros, and handles conditional compilation. The resulting translation unit is then parsed by the compiler frontend into an Abstract Syntax Tree (AST), which is lowered into an intermediate representation (IR). Modern compilers like GCC and Clang/LLVM perform the bulk of their optimization work on this IR — constant folding, dead code elimination, loop unrolling, vectorization, and register allocation all happen here, long before any machine code is emitted.

The compiler backend takes the optimized IR and performs instruction selection: mapping abstract operations to concrete instructions from the target ISA. For x86-64, this means choosing between hundreds of possible encodings, selecting appropriate addressing modes, and scheduling instructions to respect latency constraints.

Assembly and Machine Code

The assembler converts human-readable mnemonics into binary machine code. For x86-64, this is a particularly complex step because the ISA uses variable-length encoding. A single instruction can be anywhere from 1 to 15 bytes, with a rich prefix system that modifies operand sizes, enables SIMD extensions, or adds segment overrides.

Our loop body might compile to something like:

.L3:
    movslq  (%rdi,%rcx,4), %rax    ; load arr[i] with sign extension
    add     %rax, %rdx             ; total += arr[i]
    inc     %rcx                   ; i++
    cmp     %rcx, %rsi             ; compare i with n
    jg      .L3                    ; loop if i < n

Each of these mnemonics maps to a specific byte sequence. The movslq with a SIB (Scale-Index-Base) addressing mode encodes into 4 bytes. The add between two registers is just 3 bytes. This variable-length encoding is a significant source of complexity in the instruction fetch and decode stages of the pipeline, as we will explore in the Instruction Fetch and Instruction Decode sections.

Machine Code to Micro-ops

Here is where things get interesting from a microarchitecture perspective. Modern x86 processors do not execute machine instructions directly. Instead, the decode unit translates each x86 instruction into one or more micro-operations (micro-ops or uops). This is a crucial architectural decision that has defined x86 processor design since the Intel P6 microarchitecture in 1995.

The motivation is straightforward: x86 instructions vary wildly in complexity. A simple register-to-register add maps to a single micro-op, but a complex instruction like rep movsb (memory copy) might generate hundreds. By cracking these variable-complexity instructions into uniform micro-ops, the processor can feed them into a fixed-width, high-throughput execution engine.

For example, a mov from memory followed by an add might appear as two x86 instructions, but the load-and-add can sometimes be fused into a single micro-op (see Micro-fusion). Conversely, a single push %rax instruction generates two micro-ops: a store to memory and a decrement of %rsp.

Why This Matters for Performance

Understanding this transformation pipeline has direct practical implications:

Compiler optimization levels matter. The difference between -O0 and -O2 is not just “faster code” — it changes which instructions are emitted, how many micro-ops are generated, and whether the pipeline can process them efficiently. An -O2 build might vectorize your loop, replacing scalar adds with vpaddd instructions that process eight integers per cycle.

Instruction count is not a proxy for performance. Two functions with the same number of x86 instructions can have dramatically different micro-op counts. A function heavy on complex addressing modes will generate more micro-ops and may bottleneck the decoder, while a function using simple register operations will flow through the frontend at full width.

The ISA is an abstraction boundary. When you profile with perf, you see x86 instructions. But the processor operates in a micro-op domain internally. Many performance counters (IDQ delivery, retire slots, port utilization) report in micro-ops, not instructions. Learning to think in both domains is essential for effective performance analysis.

Source → Assembly → Micro-ops

Click any line to see how C code maps to assembly instructions and their micro-op decomposition.

C Source
1int sum = 0;
2for (int i = 0; i < n; i++)
3 sum += arr[i];
x86-64 Assembly
0000XOR EAX, EAX(1 uop)
0002XOR ECX, ECX(1 uop)
0004CMP ECX, EDX(1 uop)
0006JGE done(1 uop)
0008ADD EAX, [RSI+RCX*4](2 uops)
000cINC ECX(1 uop)
000eJMP loop(1 uop)
Micro-ops
XOR EAX, EAX → zero register1c
XOR ECX, ECX → zero counter1c
CMP ECX, EDX → set flags1c
JGE → conditional branch1c
Load arr[i] from memory4c
ADD sum, arr[i]1c
INC ECX → i++1c
JMP → unconditional branch1c
Integer ALUMemory LoadStoreBranch

The Micro-op as the Fundamental Unit

Throughout this guide, we will repeatedly return to the micro-op as the fundamental unit of work inside the processor. Every pipeline stage — from the frontend that produces them, to the out-of-order engine that schedules them, to the execution ports that consume them, to the retirement unit that commits their results — operates on micro-ops. The x86 instruction is what the programmer sees; the micro-op is what the machine executes.

In the next section, we will zoom out and look at how these micro-ops flow through the complete pipeline, introducing the stage-by-stage model that organizes the rest of this guide.

The Pipeline at a Glance

A modern CPU pipeline is the organizational backbone of instruction execution. It decomposes the work of running a program into discrete stages, allowing multiple instructions to be in flight simultaneously — much like an assembly line in a factory. Understanding the pipeline is the single most important mental model for reasoning about CPU performance.

Why Pipelining Exists

Consider a naive processor that executes one instruction at a time, start to finish, before moving to the next. If each instruction takes 5 nanoseconds across all phases (fetch, decode, execute, memory access, writeback), then the processor completes one instruction every 5ns — a throughput of 200 million instructions per second. That sounds fast, but it wastes enormous potential.

Pipelining exploits a critical observation: while instruction A is being executed, the processor can simultaneously decode instruction B and fetch instruction C. By overlapping these phases, the processor can ideally complete one instruction every stage duration rather than every full instruction latency. If each stage takes 1ns, throughput jumps to 1 billion instructions per second — a 5x improvement without making any single instruction faster.

Modern processors take this further with superpipeline designs (deeper pipelines with shorter stage durations) and superscalar execution (multiple instructions per stage per cycle). An Intel Golden Cove core, for instance, has a pipeline depth of roughly 20 stages and can process up to 6 micro-ops per cycle through its frontend.

Pipeline Width and Depth Across Real Processors

The table below compares key pipeline dimensions across representative modern microarchitectures. “Decode width” is the maximum instructions (or macro-ops) decoded per cycle from the MITE/legacy path; “rename/allocate width” is the maximum micro-ops dispatched into the out-of-order engine per cycle; “retire width” is the maximum micro-ops retired per cycle; “pipeline depth” is the approximate number of stages from fetch to retire.

MicroarchitectureDecode WidthRename/Alloc WidthRetire WidthApprox. Pipeline Depth
Intel Skylake (2015)5 uops/cyc (MITE: 4+1 complex)4 uops/cyc4 uops/cyc~14-19 stages [Source: Intel, Optimization Reference Manual, 2023]
Intel Golden Cove (2021)6 uops/cyc (MITE: 6-wide)6 uops/cyc8 uops/cyc~20 stages [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)8 uops/cyc8 uops/cyc8 uops/cyc~20 stages [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)4 instr/cyc6 macro-ops/cyc8 macro-ops/cyc~19 stages [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)4 instr/cyc6 macro-ops/cyc8 macro-ops/cyc~19 stages [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 5 (2024)4 instr/cyc (2x decode clusters)8 macro-ops/cyc8 macro-ops/cyc~19 stages [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)10 instr/cyc10 uops/cyc8 uops/cyc~14 stages [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)8 instr/cyc8 uops/cyc8+ uops/cyc~16 stages [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

Pipeline depth varies depending on where you draw the boundary and which paths you count (e.g., branch misprediction penalty vs. full pipeline depth). The numbers above represent approximate depth as measured by misprediction recovery latency or documented stage counts. ARM and Apple cores tend toward shorter pipelines at lower clock speeds, while Intel pushes deeper pipelines to achieve higher frequencies.

The Major Pipeline Domains

A modern x86 pipeline divides into three broad domains, each with distinct responsibilities and bottleneck characteristics.

CPU Pipeline Block Diagram

Hover for details, click to jump to the section covering each stage.

Modern Out-of-Order Superscalar PipelineGolden Cove / Zen 4 class (~18-22 stages, 6-wide)InstructionFetchDecodeRename /AllocateReservationStationExecuteRetire(Commit)Instructions flow left → right through the pipeline
Hover over a stage to see details

The Frontend

The frontend is responsible for supplying a steady stream of micro-ops to the execution engine. It encompasses:

  • Instruction Fetch: Reads instruction bytes from the L1 instruction cache (L1i), dealing with variable-length x86 encoding and cache line boundaries. See Instruction Fetch.
  • Branch Prediction: Predicts the direction and target of branches before they are even decoded, allowing speculative fetch to continue without stalling. See Branch Prediction.
  • Instruction Decode: Converts x86 instructions into micro-ops, handling the complexity of a CISC instruction set. See Instruction Decode.
  • Micro-op Caches: The DSB (Decoded Stream Buffer) caches decoded micro-ops, bypassing the expensive decode stage for hot code. The LSD (Loop Stream Detector) replays small loops from a buffer. See DSB and LSD.
  • Fusion: Micro-fusion and macro-fusion combine operations to increase effective pipeline width. See Micro-fusion and Macro-fusion.

The frontend’s job is to deliver micro-ops at the maximum rate the backend can consume. When it fails — due to cache misses, branch mispredictions, or decode bottlenecks — the pipeline starves.

The Out-of-Order Engine

The out-of-order (OoO) engine is the heart of a modern superscalar processor. It receives micro-ops from the frontend and orchestrates their execution, potentially reordering them to maximize throughput while preserving the illusion of sequential execution. Key structures include:

  • Register Renaming: Eliminates false dependencies (WAR and WAW hazards) by mapping architectural registers to a larger physical register file.
  • The Reorder Buffer (ROB): Tracks all in-flight micro-ops and ensures they retire in program order, even though they may execute out of order.
  • The Scheduler (Reservation Station): Holds micro-ops until their operands are ready, then dispatches them to available execution ports.

The Backend (Execution and Memory)

The backend contains the functional units that perform actual computation and the memory subsystem that feeds them data:

  • Execution Ports: Each port hosts a set of functional units (ALUs, FPUs, load/store units). The scheduler dispatches ready micro-ops to specific ports based on the operation type.
  • Memory Subsystem: Load and store buffers, the cache hierarchy (L1d, L2, L3), TLBs, and the memory ordering machine that ensures correct behavior in the presence of speculative and out-of-order execution.

Instructions Per Cycle (IPC)

IPC — instructions per cycle — is the central throughput metric for CPU-bound workloads. It measures how many instructions the processor retires, on average, per clock cycle. IPC is determined by:

IPC = Instructions Retired / CPU Cycles Elapsed

A processor with a theoretical maximum of 6 micro-ops per cycle might sustain an IPC of 2-4 on typical workloads, depending on the instruction mix, data dependencies, branch prediction accuracy, and cache behavior. Understanding why actual IPC falls short of the theoretical maximum is the core question that drives microarchitectural performance analysis.

There is a subtle distinction worth noting: IPC is typically reported in x86 instructions, but internally the processor works in micro-ops. The relationship between the two depends on the instruction mix. Code heavy on simple register operations will have a near 1:1 ratio. Code with complex memory operations or string instructions will generate more micro-ops per instruction, and the micro-op throughput becomes the true bottleneck even if the instruction-level IPC looks modest.

Instruction Flow Through the Pipeline

Watch instructions fill the pipeline and flow from Fetch to Retire. Use step mode to advance one cycle at a time.

FetchDecodeRenameScheduleExecuteRetireRetired: 0IPC: 0.00
Pipeline empty. Press play or step forward.

Pipeline Hazards: Why IPC Drops

Three categories of hazards prevent the pipeline from sustaining maximum throughput:

Data hazards arise when an instruction depends on the result of a prior instruction that has not yet completed. The out-of-order engine mitigates these through register renaming and dynamic scheduling, but true data dependencies (RAW hazards) impose serialization that no amount of hardware can eliminate.

Control hazards occur at branches. The processor must predict the branch outcome to continue fetching; a misprediction flushes all speculative work from the pipeline, wasting 15-20 cycles on modern cores.

Structural hazards happen when two micro-ops need the same hardware resource (e.g., both need port 1 for multiplication). The scheduler resolves these by stalling one micro-op until the resource is free.

The Performance Analysis Framework

When a workload underperforms, the pipeline model provides a structured diagnostic approach: is the frontend failing to deliver micro-ops? Is the backend stalling on data dependencies or resource conflicts? Are branch mispredictions flushing speculative work? Each bottleneck maps to specific pipeline stages and specific performance counters.

The rest of this guide walks through each pipeline stage in detail, starting with the frontend in Part 2. By the end, you will have a mental model precise enough to interpret hardware performance counters, predict the impact of code transformations, and reason about why certain optimizations work — and others do not.

Instruction Fetch

The instruction fetch stage is the entry point of the pipeline: it reads raw instruction bytes from memory (via the L1 instruction cache) and delivers them to the decode unit. While conceptually simple, fetch is surprisingly subtle on x86 processors due to the variable-length instruction encoding and the need to sustain high throughput for a superscalar backend.

Fetch Width and Throughput

On modern Intel cores, the fetch unit reads a 16-byte window from the L1 instruction cache (L1i) each cycle. This 16-byte fetch block is the fundamental unit of instruction supply from the MITE (Macro Instruction Translation Engine) path. The number of complete instructions contained in a 16-byte window varies because x86 instructions range from 1 to 15 bytes in length.

The 16-byte fetch width applies specifically to Intel Skylake-era cores. Newer and competing architectures fetch wider:

MicroarchitectureFetch Width (bytes/cycle)Notes
Intel Skylake (2015)16 B/cycSingle 16-byte aligned fetch window [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)32 B/cycDoubled fetch bandwidth; 32-byte aligned blocks [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)32 B/cycMaintains 32-byte fetch [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)32 B/cyc32-byte fetch window [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)32 B/cyc32-byte fetch window [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 5 (2024)32 B/cyc (2x16 B)Dual 16-byte fetch pipelines feeding 2 decode clusters [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)16 B/cyc (64 B fetch line)Fixed-width 32-bit instructions; 16 bytes = 4 instr fetched from 64B lines [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)32 B/cyc (estimated)Fixed-width 32-bit instructions; 32 bytes = 8 instructions, sustaining 8-wide decode [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

ARM cores benefit enormously from fixed-width 32-bit (4-byte) instructions: there is no predecode complexity and instruction boundaries are always known. This enables simpler, wider fetch designs. The x86 variable-length encoding makes 16-32 bytes the practical limit, since the predecode/ILD unit must parse instruction boundaries within each fetch window.

Consider this sequence of instructions:

inc     %eax          ; 2 bytes (REX + opcode)
add     %ebx, %ecx    ; 2 bytes
mov     %rdx, %rsi    ; 3 bytes (REX.W + opcode + ModRM)
lea     (%rdi,%r8,4), %r9  ; 4 bytes (REX + opcode + ModRM + SIB)
cmp     $0x12345678, %eax  ; 5 bytes (opcode + imm32)

These five instructions total 16 bytes and could all be fetched in a single cycle. But a sequence of longer instructions — particularly those with SIB bytes, displacement fields, and immediates — might yield only 2-3 instructions per fetch window. This variability means the fetch stage can become a bottleneck for code with large instruction footprints.

Fetch Alignment and Boundaries

The 16-byte fetch window aligns to 16-byte boundaries in memory. This has an important consequence: if a branch target lands in the middle of a 16-byte aligned block, the bytes before the target address are wasted. The fetch unit cannot use the bytes preceding the branch target, effectively reducing the useful fetch bandwidth for that cycle.

For example, if a branch targets address 0x1008 within the aligned block 0x1000-0x100F, the first 8 bytes of the fetch window are discarded. Only 8 bytes of useful instruction data are delivered. This is why compilers use .p2align directives and NOP padding to align hot loop headers and function entries to 16-byte or 32-byte boundaries — it maximizes the usable bytes in the first fetch window after a taken branch.

The perf counter frontend_retired.dsb_miss and related events can help identify cases where fetch alignment is impacting performance.

The L1 Instruction Cache

The L1i cache is the fetch unit’s primary data source. On recent Intel cores (Skylake and descendants), it is a 32 KB, 8-way set-associative cache with 64-byte cache lines. It has a 4-cycle load-to-use latency for a hit.

A few characteristics make L1i behavior distinct from L1d (data cache):

Inclusion and coherence. The L1i is typically inclusive with respect to the L2. Self-modifying code (or JIT compilation) must explicitly flush affected cache lines to ensure coherence between the instruction and data caches, usually via the clflush instruction or by executing a serializing instruction.

Access patterns. Instruction fetches are predominantly sequential with occasional jumps. This makes hardware prefetching for L1i effective for straight-line code but less so for code with many taken branches (e.g., interpreters with indirect dispatch).

Capacity pressure. A 32 KB L1i sounds generous, but large applications with sprawling hot paths — think database query engines, JIT compilers, or web browsers — can easily exceed this capacity. L1i misses stall the fetch stage and propagate as frontend starvation throughout the pipeline. The ICACHE.MISSES performance counter tracks this directly.

Predecode and Instruction Length Determination

Before instructions reach the main decoders, x86 processors perform a predecode step that determines instruction boundaries within the fetched byte stream. This is non-trivial because x86 instructions have no fixed length or regular structure — the length of an instruction depends on its opcode, prefixes, ModRM byte, SIB byte, and immediate/displacement fields.

The predecode unit marks instruction boundaries and identifies prefixes, typically processing 16 bytes per cycle. These annotations are stored in the L1i alongside the instruction bytes (or in a separate predecode buffer), so repeated fetches of the same cache line do not need to redo the work. This predecode overhead is one reason the DSB (micro-op cache) provides such a significant performance benefit for hot code: it bypasses not just decode but also predecode entirely.

Instruction Fetch Window

A 16-byte fetch window slides over variable-length x86 instructions. Notice how instruction boundaries affect how many instructions are fetched per cycle.

Instruction Stream (bytes)XORMOVADDNOPIMULCMPJEMOVADDLEAINCJMPSUBANDCALLRETFetch Window (16 bytes)Total fetched: 0 instructions
Fetch window at 0x0, 5 instructions in window.
Fetch windowEach colored group = one x86 instruction (1-7 bytes)

Fetch Bubbles and Stalls

Several conditions cause the fetch unit to deliver fewer than 16 bytes of useful instructions:

L1i cache misses. A miss incurs the L2 access latency (typically 12-14 cycles), during which no instructions are fetched from that path. For workloads with large instruction footprints, this is a primary source of frontend starvation.

Taken branches. Every taken branch effectively ends the current fetch window. The fetch unit must redirect to the branch target (predicted by the branch predictor — see Branch Prediction), which is in a new 16-byte aligned block. This means at most one taken branch can be processed per cycle from the MITE path.

Fetch-decode bandwidth mismatch. Even when fetch delivers 16 bytes per cycle, the decode unit may not be able to consume all instructions in that window due to decode width limits or complex instructions that require multiple decode cycles.

Implications for Software

Several practical guidelines follow from how fetch works:

  • Keep hot loops compact. Fewer instruction bytes means the loop is more likely to fit in a single fetch window and in the DSB. Prefer shorter instruction encodings when you have a choice (e.g., xor %eax, %eax instead of mov $0, %eax).
  • Align branch targets. Compilers generally handle this, but hand-written assembly or JIT-generated code should align loop headers. The Linux kernel uses ALIGN macros extensively for this reason.
  • Minimize code footprint. Aggressive inlining and loop unrolling improve execution efficiency but increase L1i pressure. There is a balance point where the cost of L1i misses exceeds the benefit of eliminating call overhead.
  • Beware of instruction padding. Long NOP sequences used for alignment consume fetch bandwidth and decode slots. Modern assemblers emit multi-byte NOPs (up to 15 bytes each) to minimize this cost.

The fetch stage sets the upper bound on the pipeline’s throughput. No amount of out-of-order execution wizardry can compensate for a frontend that cannot deliver instructions fast enough. In the next section, we look at branch prediction, the mechanism that enables the fetch unit to continue streaming instructions across control flow boundaries without waiting for branches to resolve.

Branch Prediction

Branch prediction is arguably the most critical speculation mechanism in a modern processor. Without it, the pipeline would stall at every branch instruction, waiting for the condition to be evaluated before knowing where to fetch next. Given that roughly 20% of all instructions are branches, this would devastate throughput. Instead, the branch predictor guesses the outcome of each branch — both its direction (taken or not taken) and its target address — allowing the fetch unit to continue speculatively at full speed.

The Cost of Misprediction

When the predictor is wrong, the processor must flush all speculatively fetched and partially executed instructions from the pipeline and restart from the correct path. On modern Intel cores, this costs approximately 15-20 cycles — the full depth of the pipeline. During this time, the backend may continue executing independent work already in the scheduler, but no new micro-ops are delivered from the frontend.

The performance impact is significant. A branch with 95% prediction accuracy that executes 100 million times per second generates 5 million mispredictions per second. At a 15-cycle penalty each, that is 75 million wasted cycles — potentially enough to reduce IPC by 20-30% depending on the workload. This is why branch prediction accuracy is one of the most impactful microarchitectural parameters.

Predictor Components

A modern branch prediction unit consists of several cooperating structures.

Direction Predictors

The direction predictor answers the question: will this conditional branch be taken or not taken?

Bimodal (Saturating Counter) Predictors. The simplest predictor uses a table of 2-bit saturating counters indexed by the branch address. Each counter tracks the recent behavior of a branch: strongly taken (11), weakly taken (10), weakly not-taken (01), strongly not-taken (00). This design handles branches with stable behavior well but fails on branches whose direction depends on history patterns.

Two-Level Predictors. These predictors incorporate the global history of recent branch outcomes. A Global History Register (GHR) records the taken/not-taken outcomes of the last N branches. This history is combined (usually XORed) with the branch address to index into a pattern history table of saturating counters. This allows the predictor to learn correlations between branches — for example, that branch B is always taken when branch A was not taken.

TAGE (TAgged GEometric) Predictors. TAGE is the state of the art, used in Intel cores from Haswell onward. It uses multiple tables indexed by different history lengths, arranged geometrically (e.g., history lengths of 5, 10, 20, 40, 80, 160). Each table entry includes a tag for partial matching. The predictor queries all tables in parallel and uses the result from the table with the longest matching history. This design adapts naturally to branches with different history requirements — simple branches match in short-history tables, while complex patterns use longer histories.

Perceptron Predictors. Used in AMD Zen architectures and Samsung Exynos cores, perceptron predictors apply a simplified neural network. Each branch has a vector of weights corresponding to positions in the global history. The dot product of the weight vector and the history vector determines the prediction. Perceptron predictors excel at learning linearly separable patterns but require a different training algorithm and area trade-off compared to TAGE.

Branch Target Buffer (BTB)

While the direction predictor determines whether a branch is taken, the Branch Target Buffer determines where it goes. The BTB is essentially a cache of branch instruction addresses mapped to their target addresses. When the fetch unit encounters an address that hits in the BTB, it can redirect fetch to the predicted target in the same cycle.

The BTB must be consulted before decode — before the processor even knows that the bytes at a given address constitute a branch instruction. This means the BTB operates on instruction addresses, not decoded instructions. Modern BTBs have thousands of entries and may use multi-level structures (a small, fast L1 BTB and a larger, slower L2 BTB) similar to the data cache hierarchy.

A BTB miss on a taken branch means the processor does not know the target address and must wait for decode to determine it. This stalls the fetch pipeline even if the direction prediction is correct.

Return Stack Buffer (RSB)

Function returns (ret instructions) are a special case: their target address changes on every call because it depends on the call site. A generic BTB would predict the return target as the address from the most recent execution, which is wrong whenever the function is called from a different location.

The Return Stack Buffer (also called the Return Address Stack) solves this by maintaining a hardware stack. Each call instruction pushes the return address onto the RSB, and each ret pops it. This provides near-perfect prediction accuracy for returns, as long as the RSB does not overflow (typically 16-32 entries deep) and call/return pairs are matched.

RSB underflows can occur with deeply recursive functions or coroutine-style control flow. Some processors fill RSB entries with a safe address (like a pause loop) on context switches to mitigate Spectre-RSB attacks.

Branch Prediction Across Real Processors

The table below compares the branch prediction structures across representative microarchitectures. BTB sizes are approximate and based on reverse-engineering measurements where vendor documentation is incomplete.

MicroarchitecturePredictor TypeBTB Entries (L1 / L2)RSB Depth
Intel Skylake (2015)TAGE-based~4,096 L1 / ~4,096-8,192 L216 entries [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)TAGE-based (improved)~12,288 L1 / ~12,288+ L216 entries [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)TAGE-based (improved)Not publicly documented16 entries (estimated, same as Golden Cove)
AMD Zen 3 (2020)Perceptron-based (TAGE+perceptron hybrid)~6,656 L1 / ~6,656+ L232 entries [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
AMD Zen 4 (2022)Perceptron-based (TAGE+perceptron hybrid)~7,168+ L1 / larger L232 entries [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)Perceptron-basedNot publicly documented32 entries [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)TAGE-basedNot publicly documentedNot publicly documented
Apple M1 Firestorm (2020)Unknown (likely TAGE-class)Not publicly documentedNot publicly documented

Intel and AMD have taken different approaches to direction prediction: Intel uses TAGE (TAgged GEometric history length) predictors since Haswell, while AMD uses perceptron-based predictors augmented with TAGE-like components since Zen 2 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]. Both achieve branch prediction accuracies above 97% on most workloads. AMD’s Zen cores tend to have deeper RSBs (32 entries vs. Intel’s 16), which benefits deeply nested call chains.

2-Bit Saturating Counter Branch Predictor

Step through a branch pattern to see how the predictor state machine updates. T=Taken, N=Not Taken.

Branch Pattern
SNNot TakenWNNot TakenWTTakenSTTakenTNTNTNN → stayT → stay
Branch predictor ready. Enter a pattern and step through.

Indirect Branches

Indirect branches (jmp *%rax, call *%rax) pose a unique challenge because their target can differ on every execution. Virtual method dispatch, jump tables (from switch statements), and function pointers all generate indirect branches.

Modern processors use an Indirect Target Array (ITA) or similar structure that uses global history to predict indirect branch targets. This allows the predictor to learn patterns like “when the last three branches went T, N, T, this indirect jump targets address X.” Performance on polymorphic call sites in object-oriented code depends heavily on this predictor’s accuracy.

Static and Dynamic Hints

Before dynamic predictors existed, processors used static heuristics: forward branches default to not-taken, backward branches to taken (since backward branches are usually loop edges). Modern processors largely ignore static hints because dynamic predictors outperform them, but understanding the heuristics helps explain why some legacy optimization guides recommend structuring if-else chains with the common case first.

Intel x86 includes branch hint prefixes (0x2E for not-taken, 0x3E for taken) that compilers can emit, though most modern cores ignore them. GCC’s __builtin_expect and C++20’s [[likely]]/[[unlikely]] attributes primarily affect code layout rather than hardware hints: the compiler arranges the likely path as the fall-through case, which improves fetch efficiency and I-cache utilization.

Practical Implications

  • Branchless code (using cmov, arithmetic masks, or SIMD) eliminates mispredictions entirely at the cost of always executing both paths. This is a net win when branch prediction accuracy is below roughly 75-85%.
  • Branch-free loop exits through techniques like sentinel values avoid a branch misprediction on the final iteration.
  • Profile-guided optimization (PGO) gives the compiler real branch frequency data, enabling it to arrange code layout for optimal fetch and I-cache behavior.
  • Sorting data before processing can transform an unpredictable data-dependent branch into a perfectly predictable one. The classic example is sorting an array before a threshold filter: the branch transitions from taken to not-taken exactly once.

Branch prediction is the enabler that makes deep pipelining viable. Without it, the frontend would stall roughly once every five instructions. With modern TAGE predictors achieving 97%+ accuracy on typical workloads, the misprediction penalty is amortized into a manageable overhead — but one that still dominates performance for branch-heavy code like interpreters, parsers, and decision trees.

Instruction Decode

Instruction decode is the stage where raw x86 instruction bytes are transformed into the micro-ops that the rest of the pipeline operates on. On a RISC architecture with fixed-width instructions, decoding is relatively straightforward. On x86, it is one of the most complex and power-hungry stages in the entire processor, consuming a significant fraction of the core’s transistor budget and energy.

The x86 Decoding Challenge

The x86-64 instruction format is notoriously irregular. An instruction consists of:

  1. Legacy prefixes (0-4 bytes): LOCK, segment overrides, operand-size override (0x66), address-size override (0x67), REP/REPNE
  2. REX/VEX/EVEX prefix (1-4 bytes): extends register encoding, enables AVX/AVX-512
  3. Opcode (1-3 bytes): identifies the operation
  4. ModRM byte (0-1 byte): specifies operands and addressing mode
  5. SIB byte (0-1 byte): Scale-Index-Base for complex addressing
  6. Displacement (0, 1, 2, or 4 bytes): memory offset
  7. Immediate (0, 1, 2, 4, or 8 bytes): constant operand

This means the decoder must parse a variable-length, prefix-dependent byte stream where the interpretation of each byte depends on the bytes that precede it. The opcode map itself has evolved over four decades, with multi-byte escape sequences (0F xx, 0F 38 xx, 0F 3A xx) grafting new instructions onto an already crowded encoding space.

Simple vs. Complex Decoders

Modern Intel cores address this complexity with an asymmetric decoder design. On Skylake-era cores, the MITE (Macro Instruction Translation Engine) decode cluster contains:

  • 4 simple decoders (D0-D3): Each can decode one x86 instruction that maps to a single micro-op per cycle. Simple instructions include register-to-register add, mov, cmp, bitwise operations, and similar single-uop operations.
  • 1 complex decoder (D0, which doubles as the complex decoder): Can handle instructions that generate up to 4 micro-ops. Instructions like push %rax (store + RSP decrement) or call (push return address + jump) go through this decoder.

This yields a maximum MITE throughput of 5 instructions per cycle when the instruction stream is favorable (4 simple + 1 complex, or 5 simple routed appropriately). However, this maximum is rarely sustained because it requires a perfect mix of instruction types and no fetch bottlenecks.

Instructions that generate more than 4 micro-ops cannot be handled by even the complex decoder. These are processed by the Microcode Sequencer (MS-ROM): a dedicated ROM that emits a sequence of micro-ops over multiple cycles. Examples include:

rep movsb          ; string copy - variable length micro-op sequence
cpuid              ; serializing, complex
div %rcx           ; integer division - ~10+ micro-ops

While the MS-ROM is active, the simple decoders are typically stalled, making microcoded instructions a significant throughput bottleneck. The IDQ.MS_UOPS performance counter tracks how many micro-ops were delivered via the microcode sequencer.

The Decode Pipeline

Decoding is not a single-cycle operation. On Skylake-class cores, the decode pipeline spans multiple stages:

  1. Predecode / Instruction Length Decode (ILD): Determines instruction boundaries in the 16-byte fetch window, identifies prefixes, and routes bytes to the appropriate decoder. As discussed in Instruction Fetch, this is a non-trivial task for variable-length encodings.

  2. Instruction Queue (IQ): Buffers predecoded instructions, smoothing out variations in fetch and decode rates. This queue decouples the fetch and decode stages.

  3. Decode: The actual translation from x86 instructions to micro-ops. Each decoder consumes one instruction from the IQ and produces 1 (simple decoders) or up to 4 (complex decoder) micro-ops.

  4. Micro-op Queue / IDQ: The Instruction Decode Queue buffers decoded micro-ops before they enter the rename/allocate stage. This is the point where micro-ops from the MITE path, the DSB (micro-op cache), and the LSD (Loop Stream Detector) converge.

Instruction Decode: x86 → Micro-ops

Click an instruction to see how it decodes into micro-ops. Simple instructions → 1 uop. Complex instructions → multiple uops via microcode.

x86 Instructions
Decoded Micro-ops
Select an instruction to see its micro-op decomposition

Decode Bottlenecks in Practice

Several patterns commonly cause decode bottlenecks:

Long instructions. Instructions with many prefixes, large immediates, or complex addressing modes reduce the number of instructions that fit in a 16-byte fetch window, starving the decoders. AVX-512 instructions with EVEX prefixes are 4+ bytes just for the prefix, plus opcode, ModRM, SIB, and potential displacement.

Instruction mix imbalance. If every instruction in a sequence requires the complex decoder or the microcode sequencer, throughput drops to 1 instruction per cycle or worse. Integer division (div/idiv) is a classic example — it occupies the decode pipeline for many cycles.

Prefix penalties. Some legacy prefix combinations cause decode delays. The 0x66 operand-size prefix before certain SSE instructions (where it acts as a mandatory prefix, not a size override) is handled efficiently, but other prefix combinations may incur a 1-cycle penalty per prefix.

Decoder serialization. Certain instructions are “serializing” at the decoder level, meaning they must be decoded alone. LOCK-prefixed instructions, some system instructions, and instructions with particular prefix combinations can force the decoder to stall other slots.

The Role of the Compiler

The compiler has significant influence over decode efficiency:

  • Instruction selection: Choosing shorter encodings when multiple options exist. For example, test %eax, %eax (2 bytes) is preferred over cmp $0, %eax (5 bytes) for checking if a register is zero. Both set flags identically, but the shorter encoding is friendlier to fetch and decode.
  • Instruction scheduling: Interleaving simple and complex instructions so the decoder slots are utilized evenly. A sequence of 4 simple instructions followed by 1 complex instruction is ideal for the decoder.
  • Avoiding microcoded operations: Replacing loop (microcoded on modern cores) with dec %ecx; jnz (two simple instructions). Replacing enter/leave with explicit stack manipulation.

NOP Handling

NOPs deserve special mention. Modern decoders can handle multi-byte NOPs (0F 1F xx) efficiently, often “zeroing them out” in the decoder without consuming execution resources. However, they still consume decode bandwidth — each NOP occupies a decode slot. The DSB can also cache NOPs as zero-latency micro-ops, but they still occupy DSB entries. This is one reason why excessive alignment padding can hurt performance: the NOPs consume frontend resources even if they are free in the backend.

Decode Width Across Real Processors

The decode stage varies significantly across vendors, particularly because ARM and Apple cores decode fixed-width instructions (much simpler than x86’s variable-length decode).

MicroarchitectureDecoders (Simple + Complex)Max MITE Decode/cycMicro-op Cache CapacityMicro-op Cache Assoc.
Intel Skylake (2015)4 simple + 1 complex (D0)5 instr/cyc1,536 uops (32 sets x 8 ways x 6 uops)8-way [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)6-wide decode6 instr/cyc4,096 uops8-way [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)8-wide decode8 instr/cycNot publicly documentedNot publicly documented [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)4-wide decode4 instr/cyc4,096 macro-ops (Op Cache)8-way [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)4-wide decode4 instr/cyc6,750 macro-ops (Op Cache)Not publicly documented [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)2x 4-wide decode clusters8 instr/cyc totalNot publicly documentedNot publicly documented [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)10-wide decode (fixed-width)10 instr/cycN/A (no uop cache needed)N/A [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)8-wide decode (fixed-width)8 instr/cycN/A (no uop cache needed)N/A [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

ARM and Apple cores do not need a micro-op cache because their fixed-width instruction encoding makes decoding simple and fast. The x86 micro-op cache (called “DSB” by Intel and “Op Cache” by AMD) exists specifically to amortize the high cost of variable-length x86 decode. See DSB for details.

From Decode to the IDQ

After decoding, micro-ops enter the Instruction Decode Queue (IDQ), which serves as a convergence point and buffer. The IDQ feeds the allocate/rename stage at up to 4 (Skylake) or 6 (Golden Cove) micro-ops per cycle. This rename width, not the decode width, is often the true frontend throughput bottleneck, and the IDQ smooths out cycle-to-cycle variations in decode throughput to keep the allocator fed consistently.

In the next sections, we examine two mechanisms — micro-fusion and macro-fusion — that effectively increase the decode bandwidth by combining multiple operations into single micro-ops.

Micro-fusion

Micro-fusion is a hardware optimization that allows the decoder to combine a memory operation and an ALU operation from a single x86 instruction into one fused micro-op in the fused domain, even though they will later be dispatched as two separate operations in the unfused domain. This effectively increases the frontend’s throughput by packing more work into each pipeline slot.

The Problem Micro-fusion Solves

Many x86 instructions combine a memory access with a computation. Consider:

add     (%rdi), %eax       ; load from memory, then add to eax

This single instruction performs two distinct operations: a load from the address in %rdi and an addition into %eax. Internally, the processor must execute both a load micro-op (dispatched to a load port) and an ALU micro-op (dispatched to an ALU port). Without micro-fusion, this instruction would consume two slots in the IDQ and two slots at the rename/allocate stage.

With micro-fusion, the decoder emits a single fused micro-op that occupies one slot through the frontend — in the IDQ, through rename/allocate, and in the ROB. Only when the scheduler dispatches it to execution ports does it unfuse into two separate operations.

Fused Domain vs. Unfused Domain

This distinction is fundamental to interpreting performance counters:

  • Fused domain: The view from the frontend and ROB. Micro-fused operations count as one micro-op. The uops_issued.any counter reports in the fused domain. The ROB tracks entries in the fused domain, so micro-fusion effectively increases the ROB’s logical capacity.
  • Unfused domain: The view from the scheduler and execution ports. Each micro-fused operation splits into its constituent parts. The uops_dispatched_port.* counters report in the unfused domain.

This means a function that issues 100 fused-domain micro-ops might dispatch 130 unfused-domain micro-ops. The throughput bottleneck might be in the frontend (fused domain) or the backend (unfused domain), and you need to check both sets of counters to diagnose correctly.

What Can Be Micro-fused

Micro-fusion applies to instructions that combine a memory operand with an arithmetic or logical operation. Typical examples:

; These micro-fuse (1 fused-domain uop each):
add     (%rdi), %eax           ; load + add
cmp     %eax, (%rdi)           ; load + compare
or      $0xFF, (%rdi)          ; load + or + store (read-modify-write)
mov     %eax, (%rdi)           ; store (address + data, fused as one)

In each case, the memory access (load or store) and the ALU operation fuse into a single frontend micro-op. For stores, the store-address and store-data micro-ops are fused together. For read-modify-write instructions like or $0xFF, (%rdi), the instruction produces a load, an ALU op, and a store — the load and ALU op micro-fuse, and the store is an additional micro-op, resulting in 2 fused-domain micro-ops total.

Micro-fusion Eligibility Rules

Not all memory-referencing instructions can be micro-fused. The rules have evolved across microarchitectures, with notable restrictions introduced in Sandy Bridge and refined since:

Indexed Addressing Mode Restriction

On Sandy Bridge through Skylake, instructions using a three-component indexed addressing mode (base + index + displacement, or base + index * scale + displacement) can be decoded as micro-fused but are unlaminated (un-fused) during the rename/allocate stage. This means they still consume two slots in the ROB and two allocate bandwidth slots, negating much of the benefit.

; Micro-fuses and STAYS fused:
add     (%rdi), %eax           ; base only
add     8(%rdi), %eax          ; base + displacement
add     (%rdi,%rsi), %eax      ; base + index (no displacement)

; Decoded as micro-fused but UNLAMINATED at rename (on SNB-SKL):
add     8(%rdi,%rsi), %eax     ; base + index + displacement
add     (%rdi,%rsi,4), %eax    ; base + index*scale (RIP: has scale)

The key distinction: if the addressing mode uses an index register AND either a displacement or a scale factor other than 1, the micro-fusion is undone at rename. On more recent microarchitectures (Ice Lake and later), some of these restrictions have been partially relaxed, but the general principle remains important for performance-critical inner loops.

Micro-fusion Changes Across Intel Microarchitectures

The indexed addressing mode restriction has evolved notably:

  • Sandy Bridge through Skylake (2011-2015): Instructions with indexed addressing modes (base + index*scale or base + index + displacement) are decoded as micro-fused but unlaminated at the rename stage, consuming 2 ROB entries instead of 1 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023].
  • Golden Cove (2021) and later: Intel relaxed some unlamination restrictions. Certain indexed addressing modes that were previously unlaminated now stay micro-fused through rename, improving effective ROB utilization for code with complex addressing patterns [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]. The specific rules are complex and instruction-dependent; testing with uops_issued.any vs. uops_dispatched_port.* counters on your target hardware is recommended.
  • AMD Zen family: AMD’s macro-op (analogous to Intel’s micro-op fusion concept) handling differs. AMD’s Zen cores decode x86 instructions into macro-ops, and load-op fusion works differently from Intel’s micro-fusion. AMD does not document “unlamination” as a concept; their op-cache delivers fused macro-ops that stay fused through dispatch [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022].

The practical implication: code that was penalized by unlamination on Skylake may perform better on Golden Cove without any source changes. When tuning for a specific microarchitecture, verify micro-fusion behavior with hardware counters rather than relying on general rules.

Instruction-Specific Restrictions

Certain instruction classes cannot micro-fuse regardless of addressing mode:

  • Instructions with three or more source operands that exceed the register file read ports
  • Some VEX-encoded instructions with certain operand combinations
  • Instructions using REP prefixes (they go through the microcode sequencer)
  • Certain SIMD instructions with memory operands that have specific alignment requirements

Micro-fusion

A load + ALU operation can merge into one fused-domain micro-op, saving pipeline slots.

ADD RAX, [RBX]
Addressing: [base]
Unfused (2 pipeline slots):
LOAD tmp, [RBX]
ADD RAX, tmp
Fused (1 pipeline slot):
LOAD+ADD RAX, [RBX]

Simple base addressing → fusable

Performance Impact

Micro-fusion is most impactful in code where memory operations dominate, which is common in real-world workloads. Consider a loop that processes an array:

.loop:
    add     (%rdi), %eax       ; 1 fused-domain uop (micro-fused)
    add     $4, %rdi           ; 1 uop
    dec     %ecx               ; 1 uop (macro-fuses with jnz)
    jnz     .loop              ; 0 uops (macro-fused with dec)

Without micro-fusion, the add (%rdi), %eax would consume 2 frontend slots, making the loop body 4 fused-domain micro-ops. With micro-fusion, it consumes 1 slot, reducing the loop to 3 fused-domain micro-ops (further reduced to 2 by macro-fusion of dec/jnz — see Macro-fusion). On a core with a 4-wide rename, this means the entire loop iteration can be issued in a single cycle.

Implications for Developers

  • Prefer simple addressing modes. Using base + displacement instead of base + index*scale + displacement preserves micro-fusion. In practice, this means restructuring loops to use pointer increments rather than indexed addressing when performance is critical.
  • Check unfused micro-op counts. When profiling with perf or VTune, compare uops_issued.any (fused domain) with uops_dispatched_port.port_* sums (unfused domain). A large discrepancy indicates heavy unlamination.
  • Compiler awareness. Modern compilers are generally aware of micro-fusion rules and try to select addressing modes that preserve fusion. However, manual inspection of hot loops can still reveal opportunities, particularly in code generated from complex index expressions.

Micro-fusion is a frontend bandwidth optimization: it does not make any individual operation faster, but it allows the frontend to deliver more work per cycle by packing operations more efficiently. Its companion technique, macro-fusion, operates at the instruction level rather than the micro-op level, and we cover it next.

Macro-fusion

Macro-fusion (also called macro-op fusion) is a decoder optimization that merges two separate x86 instructions into a single micro-op. Unlike micro-fusion, which combines the sub-operations of a single instruction, macro-fusion operates across instruction boundaries, fusing a flag-setting instruction with a subsequent conditional jump into one compare-and-branch micro-op. This saves a decode slot, a ROB entry, and an allocate/rename slot — a pure throughput win.

How Macro-fusion Works

The most common pattern is a comparison or test instruction followed by a conditional jump:

cmp     %eax, %ebx
jne     .target

Without macro-fusion, this pair decodes into two micro-ops: one for the comparison (which sets flags) and one for the conditional branch (which reads flags and changes control flow). With macro-fusion, the decoder recognizes the pair and emits a single “compare-and-branch” micro-op that performs both operations.

This fused micro-op executes on a single branch port and retires as a single ROB entry. From the frontend’s perspective, two instructions become one micro-op, effectively making the decode-to-issue path 1 slot narrower for that pair.

Qualifying Instruction Pairs

Not all flag-setting + branch pairs qualify for macro-fusion. The rules are specific and have expanded over successive microarchitectures.

Flag-setting Instructions (First Instruction)

The following instructions can serve as the first instruction in a fused pair:

InstructionDescriptionNotes
CMPCompareMost common fusion source
TESTBitwise AND testVery common in null checks
ADDAdditionIntel Core 2 and later
SUBSubtractionIntel Core 2 and later
INCIncrementHaswell and later
DECDecrementHaswell and later
ANDBitwise ANDSandy Bridge and later

On early Core 2 processors, only CMP and TEST qualified. Sandy Bridge added ADD, SUB, and AND. Haswell extended this to INC and DEC, which was significant because these instructions are extremely common in loop counters.

Conditional Jumps (Second Instruction)

The second instruction must be a conditional jump (Jcc). However, not all condition codes qualify with all first instructions:

  • CMP and TEST can fuse with any Jcc condition code (JE, JNE, JL, JGE, JA, JBE, JS, JNS, etc.).
  • ADD, SUB, INC, DEC, AND can only fuse with a subset: JE/JZ, JNE/JNZ, JL/JNGE, JGE/JNL, JLE/JNG, JG/JNLE (the signed comparisons and zero flag checks). They cannot fuse with unsigned conditions like JA, JB, JAE, JBE because these depend on the carry flag, which INC/DEC do not set.

Additional Constraints

Several conditions prevent macro-fusion even when the instruction pair is otherwise eligible:

  1. The pair must be in the same 16-byte decode window. If the CMP is the last instruction in one 16-byte fetch block and the JCC is the first instruction of the next, they cannot fuse. Alignment of branch sequences matters.

  2. No prefix on the Jcc. If the jump has a segment override prefix, branch hint prefix, or other legacy prefix, fusion is blocked.

  3. The first instruction must not be a memory destination read-modify-write form. For example, add $1, (%rdi) followed by jnz .target typically does not macro-fuse because the add itself generates multiple micro-ops (load, add, store).

  4. Decode width interaction. On processors with asymmetric decoders, macro-fusion typically occurs at the first decoder slot (D0). This means at most one macro-fusion can occur per decode cycle.

Intel vs. AMD Macro-fusion Differences

Macro-fusion rules differ significantly between Intel and AMD:

Intel (Skylake through Lion Cove): Supports macro-fusion of CMP, TEST, ADD, SUB, INC, DEC, and AND with Jcc. On Skylake, up to one macro-fusion per cycle from the MITE path. On Golden Cove and later, the wider decode path can perform multiple fusions per cycle [Source: Intel, Optimization Reference Manual, 2023]. Intel also macro-fuses in 64-bit mode without restriction.

AMD (Zen 3 through Zen 5): AMD’s macro-op fusion is more restrictive in which instruction pairs qualify. AMD Zen cores support fusion of CMP/TEST with Jcc, and ADD/SUB/INC/DEC with Jcc (subject to condition code restrictions similar to Intel). However, AMD performs macro-fusion only in the integer pipeline (not with floating-point compare + branch sequences) and historically has more constraints on which condition codes qualify [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]. AMD can perform up to 2 macro-fusions per cycle on Zen 3/4, matching their 4-wide decode with up to 6 macro-ops output [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023].

ARM and Apple cores: ARM AArch64 uses a combined compare-and-branch instruction (CBZ, CBNZ, TBZ, TBNZ) natively in the ISA, so macro-fusion at the hardware level is less necessary. The ISA design avoids the problem that macro-fusion solves on x86. Apple’s M-series cores similarly benefit from AArch64’s compare-and-branch instructions.

Macro-fusion

Two adjacent instructions (CMP/TEST + Jcc) can merge into a single micro-op during decode.

CMP RAX, RBX
+
JE label
CMP+JE (1 uop)

CMP + Jcc → macro-fuses into a single compare-and-branch uop

Impact on Effective Decode Width

Macro-fusion effectively increases the number of x86 instructions processed per cycle. Consider a decode cluster that handles 4 instructions per cycle. If one pair macro-fuses, the decoder consumes 5 instructions but produces only 4 micro-ops. This is a 25% increase in effective instruction throughput for that cycle.

In loops, where compare-and-branch pairs appear on every iteration, macro-fusion consistently saves one micro-op per iteration. For a tight loop with 3 micro-ops (after macro-fusion), the loop runs at 1 iteration per cycle on a 4-wide machine, leaving a slot for other work. Without macro-fusion, the loop would be 4 micro-ops, fully saturating the issue width.

A Practical Example

Consider this common loop pattern:

.loop:
    mov     (%rdi,%rcx,4), %eax    ; load array element
    add     %eax, %edx             ; accumulate
    inc     %rcx                   ; increment index
    cmp     %rcx, %rsi             ; compare index with limit
    jl      .loop                  ; loop if index < limit

Without any fusion, this is 5 micro-ops per iteration (assuming the mov micro-fuses its load). With macro-fusion of cmp + jl, it becomes 4 fused-domain micro-ops. The loop fits neatly in a 4-wide issue cycle.

Now consider what happens if we restructure to use dec + jnz instead:

.loop:
    mov     (%rdi,%rcx,4), %eax    ; load array element
    add     %eax, %edx             ; accumulate
    inc     %rcx                   ; increment index
    dec     %rsi                   ; decrement counter
    jnz     .loop                  ; loop if counter != 0

On Haswell+, dec + jnz also macro-fuses, giving the same 4 fused-domain micro-ops. On earlier microarchitectures where DEC could not macro-fuse, this would remain at 5 micro-ops — a measurable performance difference.

Compiler Behavior

Modern compilers are aware of macro-fusion rules and actively arrange code to enable it:

  • Placing comparisons immediately before jumps. Compilers avoid inserting instructions between a cmp/test and its corresponding jcc. If other work must happen, flags-preserving instructions (like mov or lea) can be inserted, but any flag-clobbering instruction breaks the fusion opportunity.
  • Choosing fusible instructions. When the compiler needs to test a loop counter, it prefers cmp or test (which fuse with the widest set of conditions) over arithmetic operations that have limited fusion eligibility.
  • JCC Erratum mitigation. Intel disclosed a microcode erratum on Skylake-era processors where a Jcc instruction spanning a 32-byte boundary could cause a performance penalty. Compilers (with -mbranches-within-32B-boundaries or equivalent) now insert padding to prevent this, which can inadvertently push cmp+jcc pairs across 16-byte boundaries and break macro-fusion. This is a real-world example of micro-architectural constraints influencing code generation.

Relationship to Branch Prediction

Macro-fused compare-and-branch micro-ops are predicted and executed as single units by the branch prediction and execution machinery. This means the branch predictor sees one operation rather than two, and the branch resolution port handles the fused operation atomically. The BTB indexes by the address of the first instruction in the pair (the CMP or TEST), which can affect BTB hit rates if code is rearranged.

Macro-fusion is one of several techniques — alongside micro-fusion, the DSB, and the LSD — that allow the frontend to deliver micro-ops at rates exceeding what the raw decode width would suggest. Together, they form a layered optimization that can sustain throughput close to the theoretical rename/allocate width.

Micro-op Cache (DSB)

The Decoded Stream Buffer (DSB), commonly called the micro-op cache or uop cache, is a critical frontend structure that caches decoded micro-ops. By serving micro-ops directly from the DSB, the processor bypasses the entire decode pipeline — predecode, instruction length determination, and the complex MITE decoders — eliminating one of the most power-hungry and latency-sensitive stages in the frontend.

Why the DSB Exists

As discussed in Instruction Decode, the x86 decode pipeline is complex and power-hungry. The variable-length instruction format requires multi-stage predecode logic, asymmetric decoders (simple vs. complex), and a microcode ROM for the most complex instructions. All of this consumes significant die area and energy.

The key insight behind the DSB is that most programs spend the majority of their execution time in hot loops and frequently called functions. If the processor can decode these instructions once and cache the resulting micro-ops, subsequent executions can skip decoding entirely. This is analogous to how the L1 instruction cache avoids re-fetching from main memory, but operating one level higher in the abstraction stack: caching decoded micro-ops rather than raw instruction bytes.

The DSB was introduced in Intel Sandy Bridge and has been present in every subsequent Intel core microarchitecture.

DSB Organization

On Skylake-class cores, the DSB has the following characteristics:

ParameterValue
Total capacity~1536 micro-ops
Associativity8-way set associative
Sets32
Entries per way6 micro-ops per way
MappingIndexed by instruction address (32-byte aligned regions)
Delivery bandwidthUp to 6 micro-ops per cycle

The DSB is organized around 32-byte regions of instruction address space. Each 32-byte region maps to a set in the DSB, and each way in that set can hold up to 6 micro-ops. This means a 32-byte region of code can cache at most 8 x 6 = 48 micro-ops across all ways, though in practice a way often maps to a subset of the 32-byte region.

The delivery bandwidth of 6 micro-ops per cycle (on Skylake; 8 on Golden Cove) exceeds the MITE decode path’s 4-5 micro-ops per cycle, making the DSB the preferred frontend path. When the DSB can supply all the micro-ops a code region needs, the MITE decoders are essentially powered down for that region, saving significant energy.

DSB / Micro-op Cache Capacity Across Microarchitectures

The table below compares DSB (or equivalent micro-op/op-cache) sizes across vendors. AMD calls their equivalent structure the “Op Cache.”

MicroarchitectureCapacityOrganizationDelivery BW
Intel Skylake (2015)1,536 uops32 sets x 8 ways x 6 uops/way6 uops/cyc [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)4,096 uops64 sets x 8 ways x 8 uops/way8 uops/cyc [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)Not publicly documentedNot publicly documented8 uops/cyc [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)4,096 macro-ops (Op Cache)64 sets x 8 ways x 8 ops/way8 macro-ops/cyc [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)6,750 macro-ops (Op Cache)Not publicly documented9 macro-ops/cyc [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)Not publicly documentedNot publicly documentedNot publicly documented [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]

Notable observations: AMD’s Op Cache has been larger than Intel’s DSB since Zen 2, and Zen 4 significantly expanded it to 6,750 entries. Intel closed the gap with Golden Cove’s expansion to 4,096 entries (up from Skylake’s 1,536) [Source: WikiChip, “Golden Cove”, 2021]. ARM and Apple cores do not use a micro-op cache since their fixed-width instructions make decode simple enough that caching decoded output is not cost-effective.

DSB Hit and Miss Behavior

On each cycle, the frontend attempts to serve micro-ops from the DSB. The lookup is based on the current fetch address (or predicted branch target address from the branch predictor):

DSB Hit: The micro-ops are delivered directly from the DSB to the IDQ (Instruction Decode Queue), bypassing the MITE path entirely. This is the fast path: lower latency, higher bandwidth, lower power.

DSB Miss: The frontend falls back to the MITE (legacy decode) path. The instruction bytes are fetched from the L1i, predecoded, and decoded by the MITE decoders. The resulting micro-ops are then written into the DSB for future use and simultaneously delivered to the IDQ.

The idq.dsb_uops performance counter tracks micro-ops delivered from the DSB, while idq.mite_uops tracks those from the MITE path. A high ratio of DSB-delivered micro-ops (>80%) is typical for well-optimized hot loops. A workload dominated by MITE delivery suggests either a large instruction footprint that exceeds DSB capacity or code patterns that defeat DSB caching.

Conditions That Defeat the DSB

Several scenarios cause code to miss in the DSB or prevent caching:

Capacity Pressure

With ~1536 micro-ops of capacity, the DSB can hold the working set of many tight loops but is easily overwhelmed by large function bodies, deeply inlined code, or programs with many distinct hot paths. A function body that decodes to more than ~1500 micro-ops will thrash the DSB.

32-Byte Boundary Alignment

Because the DSB indexes by 32-byte aligned regions, instructions that span a 32-byte boundary belong to two different DSB sets. If a single instruction straddles a boundary, it must be associated with one side or the other, and the micro-ops may not pack efficiently into the available ways.

This 32-byte alignment constraint is why the JCC erratum mitigation (see Macro-fusion) can impact DSB efficiency: inserting padding to prevent branches from spanning 32-byte boundaries may push other instructions across boundaries, changing the DSB mapping.

Way Limit (6 Micro-ops per Way)

Each DSB way can hold at most 6 micro-ops. If a 32-byte code region decodes to many micro-ops and they cannot be distributed across ways efficiently, some micro-ops may not fit. This causes the region to fall back to MITE decode. Dense code with many short instructions is DSB-friendly; code with many multi-uop instructions may exceed the way limit.

Microcoded Instructions

Instructions that are handled by the microcode sequencer (e.g., rep movsb, div, cpuid) cannot be cached in the DSB. When the processor encounters a microcoded instruction, the DSB line containing it is invalidated or marked as non-cacheable, and the frontend switches to MITE + MS-ROM delivery for that region.

Micro-op Cache (DSB)

The DSB caches decoded micro-ops, bypassing the decode stage on hot paths. Loops that fit get higher throughput.

DSB
Serving from micro-op cache (decode bypassed)
6
uops/cycle throughput
DSB usage: 32 / 2048 entries1.6%

DSB-MITE Switching Penalty

Transitioning between DSB and MITE delivery paths incurs a penalty, typically a 1-cycle bubble in the frontend. If code alternates frequently between DSB-cached and non-cached regions (e.g., a hot loop that calls a cold function on every iteration), these switching penalties accumulate.

The idq.mite_all_uops and idq.dsb_cycles counters can help quantify switching overhead. The DSB2MITE_SWITCHES.PENALTY_CYCLES counter (available on some microarchitectures) directly measures this.

This switching penalty has practical implications for code layout. Link-time optimization (LTO) and profile-guided optimization (PGO) can arrange functions so that hot code is contiguous, minimizing DSB-MITE transitions. Separating hot and cold code paths (e.g., moving error handling out of the hot path) also helps keep the hot path in the DSB.

Interaction with the LSD

The Loop Stream Detector (LSD) works closely with the DSB. When the LSD detects a small loop, it locks the loop’s micro-ops in a buffer and replays them without even consulting the DSB. This saves DSB bandwidth and power. The LSD can be thought of as a level-0 cache sitting in front of the DSB, specialized for very small loops.

Practical Guidance

Keep hot loops small enough for the DSB. A loop body that decodes to fewer than ~32-48 micro-ops will comfortably fit in one or two DSB sets and enjoy full DSB delivery. Unrolling a loop beyond this point may push it out of the DSB, trading backend throughput for frontend starvation.

Minimize microcoded instructions in hot paths. Replace rep movsb with explicit mov loops (or memcpy, which the compiler will optimize) in performance-critical code. Replace div with multiplication by magic constants when the divisor is known at compile time.

Profile DSB hit rates. Use perf stat -e idq.dsb_uops,idq.mite_uops to measure the fraction of micro-ops served from the DSB. If MITE delivery dominates for a hot workload, investigate instruction footprint reduction strategies: less aggressive inlining, smaller code alignment padding, or splitting hot and cold paths.

Consider code alignment. Function and loop alignment affects DSB set mapping. Compilers typically align to 16 bytes, but 32-byte alignment may improve DSB packing for critical loops. The trade-off is the NOP padding cost (consuming both fetch bandwidth and I-cache space).

The DSB is one of the most impactful frontend optimizations in modern Intel processors. For workloads with high code reuse (most compute-bound applications), it effectively reduces the frontend to a simple cache lookup, eliminating the decode bottleneck that has plagued x86 processors since the architecture’s inception.

Loop Stream Detector (LSD)

The Loop Stream Detector (LSD) is a frontend power and bandwidth optimization that detects small loops and replays their micro-ops from a dedicated buffer, bypassing both the MITE decode pipeline and the DSB (micro-op cache). When active, the LSD essentially turns the frontend into a simple circular buffer replay, eliminating fetch, predecode, decode, and DSB lookup power consumption for the loop’s duration.

How the LSD Works

The LSD monitors the micro-op stream entering the IDQ (Instruction Decode Queue). When it detects that the same sequence of micro-ops is being delivered repeatedly — that is, the program is executing a tight loop — it locks the loop body into a buffer and begins replaying directly from that buffer.

The detection mechanism works as follows:

  1. The LSD observes micro-ops flowing into the IDQ, whether from the DSB or the MITE path.
  2. When a backward branch (a branch whose target precedes its own address) is encountered and predicted taken, the LSD begins recording the micro-op sequence.
  3. On the second iteration, the LSD compares the incoming micro-ops against the recorded sequence. If they match exactly, the LSD locks the loop and takes over delivery.
  4. Subsequent iterations are served entirely from the LSD buffer. The fetch unit, predecode logic, MITE decoders, and DSB are all effectively idle for this code path.
  5. When the backward branch is finally predicted not-taken (loop exit), the LSD releases control and the frontend resumes normal operation.

LSD Capacity Constraints

The LSD buffer has a limited capacity, and loops exceeding this capacity cannot be replayed. The specific limits vary by microarchitecture:

MicroarchitectureLSD Capacity (micro-ops)
Sandy Bridge / Ivy Bridge28
Haswell / Broadwell56
Skylake64 (disabled via microcode update, see below)
Ice Lake / Tiger Lake64
Golden Cove (2021)Not publicly documented (LSD may be integrated into DSB path)
Lion Cove (2024)Not publicly documented

Intel LSD disable history: The LSD on Skylake (client) and Kaby Lake was disabled in early 2018 via Intel microcode update (revision 0xC6 and later) to mitigate a variant of the Spectre vulnerability (branch prediction side-channel). The LSD was also disabled on Skylake-X server parts. Subsequent microarchitectures (Ice Lake and later) re-enabled the LSD with hardware-level mitigations [Source: Intel, “Microcode Revision Guidance”, 2018; Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023].

AMD: AMD Zen cores do not implement a traditional LSD in the same way as Intel. The Zen Op Cache (see DSB) serves a similar role for hot loops, and AMD’s loop detection is integrated into the Op Cache delivery path rather than being a separate named structure [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022].

ARM and Apple: ARM Cortex cores and Apple’s M-series cores handle small loops efficiently within their decode pipeline and branch predictor loop buffers, though the exact implementation varies. Apple’s Firestorm core has been observed to have loop buffer behavior consistent with an LSD-like structure [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021].

These numbers refer to fused-domain micro-ops (see Micro-fusion). A loop body of exactly 64 fused-domain micro-ops on Skylake will just barely fit; 65 micro-ops and the LSD cannot engage.

The practical consequence: keep innermost loops tight. A loop that the compiler unrolls 8x might decode to 80 micro-ops and miss the LSD, whereas the 4x-unrolled version at 40 micro-ops fits comfortably. The optimal unroll factor for a given loop depends on this threshold as well as the DSB capacity and backend throughput — it is a multi-dimensional trade-off.

Disqualifying Conditions

Even if a loop is small enough to fit in the LSD buffer, several conditions prevent the LSD from engaging:

Microcoded Instructions

If the loop body contains any instruction that requires the microcode sequencer (e.g., div, rep movsb, cpuid), the LSD cannot replay the loop. The MS-ROM delivery path is incompatible with LSD buffer replay. This is a strong incentive to avoid microcoded instructions in tight inner loops.

Mismatched Branch Predictions

The LSD locks a single taken-path through the loop body. If the loop contains inner conditional branches whose prediction changes from iteration to iteration, the micro-op sequence changes and the LSD cannot replay a fixed buffer. Loops with data-dependent inner branches typically defeat the LSD.

For example, this loop has a stable micro-op sequence (no inner branches):

.loop:
    mov     (%rdi), %eax
    add     %eax, %edx
    add     $4, %rdi
    dec     %ecx
    jnz     .loop

But this loop has a data-dependent inner branch that may change the micro-op trace:

.loop:
    mov     (%rdi), %eax
    test    %eax, %eax
    jz      .skip            ; varies per iteration
    add     %eax, %edx
.skip:
    add     $4, %rdi
    dec     %ecx
    jnz     .loop

If the inner jz .skip changes outcome between iterations, the LSD cannot establish a stable trace and will not engage. However, if the branch outcome is perfectly predictable (e.g., always taken or always not-taken), the LSD can lock onto the specific path and replay it.

Nested Loops

The LSD tracks a single loop level. If the loop body contains a nested inner loop, the LSD may lock onto the inner loop rather than the outer loop, or it may fail to detect a stable pattern if control flow alternates between loop levels. The innermost loop without further nesting is the most LSD-friendly structure.

Self-Modifying Code and Cross-Modifying Code

Any modification to the instruction stream in the loop body (detected via the machine clear mechanism) invalidates the LSD buffer and forces a re-detection.

Loop Stream Detector (LSD)

Small loops can be detected and replayed from a buffer, avoiding even the DSB. Adjust loop size and conditions to see when LSD engages.

LSD
Loop detector
DSB
Micro-op cache
MITE
Legacy decode

The LSD Disable on Skylake

It is worth noting a significant historical event: Intel disabled the LSD on Skylake and Kaby Lake processors via a microcode update in 2018 to mitigate a variant of the Spectre vulnerability. The LSD could speculatively replay micro-ops from a poisoned buffer, creating a side-channel attack vector. On these processors, the LSD is non-functional regardless of loop characteristics.

The LSD was re-enabled (with mitigations) in subsequent microarchitectures, but this episode illustrates how security considerations can override performance optimizations. If you are profiling on Skylake/Kaby Lake systems, expect to see zero lsd.uops events and higher DSB/MITE delivery counts for small loops.

You can check whether the LSD is active on your system by looking at the lsd.uops performance counter:

perf stat -e lsd.uops,idq.dsb_uops,idq.mite_uops -- ./benchmark

If lsd.uops is zero for a workload with tight loops, the LSD is likely disabled.

Performance Impact

When the LSD is active, it provides several benefits:

Power savings. The fetch unit, predecode logic, MITE decoders, and DSB are not accessed during LSD replay. For mobile and server workloads where power efficiency matters, this is significant.

Consistent delivery bandwidth. The LSD delivers micro-ops at a fixed rate (up to the rename/allocate width) without the cycle-to-cycle variability of MITE decode or DSB delivery. There are no decode bubbles, no DSB-MITE switching penalties, and no fetch alignment effects.

Reduced branch predictor pressure. The backward branch at the end of the loop is handled by the LSD rather than consuming branch predictor bandwidth. This frees predictor resources for other branches in the program.

The magnitude of the performance benefit depends on the loop characteristics. For a very tight loop (3-4 micro-ops) that fits in the LSD, the benefit over DSB delivery may be modest — perhaps a few percent from eliminating occasional DSB delivery bubbles. For a loop near the DSB capacity limit that would otherwise thrash between DSB and MITE delivery, the LSD can provide a more substantial improvement.

Practical Recommendations

Measure, do not assume. Use performance counters to verify whether your hot loops are being served by the LSD (lsd.uops), DSB (idq.dsb_uops), or MITE (idq.mite_uops). The answer may surprise you, especially on processors with the LSD disabled.

Minimize inner loop micro-op count. Aim for fewer than 56-64 fused-domain micro-ops in your innermost loops. Use perf stat -e uops_retired.total on a known iteration count to calculate the per-iteration micro-op cost.

Avoid microcoded instructions in tight loops. As noted above, any microcoded instruction disqualifies the loop from LSD replay. Replace div with multiplication by reciprocal, avoid rep-prefixed string operations, and use explicit instruction sequences instead of legacy complex instructions.

Consider unroll factor trade-offs. More unrolling improves backend efficiency (fewer loop overhead micro-ops per useful work) but may push the loop body beyond the LSD or DSB capacity. Profile different unroll factors to find the sweet spot.

The LSD represents the innermost layer of the frontend’s three-tier delivery hierarchy: LSD for tiny loops, DSB for hot code, and MITE as the fallback decoder. Understanding which tier serves your hot code is a prerequisite for effective frontend optimization.

Allocation & Register Renaming

The x86-64 ISA exposes only 16 general-purpose architectural registers (RAX through R15). When multiple instructions in flight target the same register, the processor faces a fundamental conflict: how do you execute instructions out of order if they keep writing to the same small set of names? Register renaming is the mechanism that resolves this conflict, and it is one of the single most important enablers of instruction-level parallelism (ILP) in modern out-of-order cores.

The Problem: False Dependencies

Consider this short assembly sequence:

ADD  RAX, RBX, RCX    ; (1) RAX = RBX + RCX
MUL  RBX, RAX, RDX    ; (2) RBX = RAX * RDX   — RAW on RAX from (1)
SUB  RCX, RAX, RBX    ; (3) RCX = RAX - RBX   — RAW on RBX from (2)
MOV  RAX, RDI          ; (4) RAX = RDI          — WAW on RAX with (1)
ADD  RDX, RAX, RCX    ; (5) RDX = RAX + RCX   — RAW on RAX from (4)

There are three classes of data dependencies at play here:

  • RAW (Read-After-Write) — true dependencies. Instruction (2) genuinely needs the result of (1) because it reads RAX which (1) produces. These dependencies are fundamental and cannot be removed; the consumer must wait for the producer.
  • WAR (Write-After-Read) — anti-dependencies. If instruction (4) writes RAX before (2) has read the old RAX, the old value is lost. This is not a real data flow requirement; it only exists because the two instructions happen to reuse the same register name.
  • WAW (Write-After-Write) — output dependencies. Instructions (1) and (4) both write RAX. If (4) completes before (1), the final value of RAX would be wrong. Again, this is an artifact of naming, not data flow.

WAR and WAW are called false dependencies (or name dependencies). They serialize execution unnecessarily. In the example above, instruction (4) could theoretically execute in parallel with (1)–(3), but the WAW on RAX prevents that — unless the hardware eliminates the name collision.

The Solution: Physical Register File and the RAT

Modern out-of-order processors maintain a physical register file (PRF) that is much larger than the architectural register set. Intel’s Golden Cove core, for example, has approximately 280 integer physical registers backing just 16 architectural names. AMD’s Zen 4 provides a similar ratio.

Physical Register File Sizes Across Microarchitectures

The table below compares the physical register file sizes for integer and FP/SIMD registers. These sizes determine the maximum number of in-flight instructions that can write registers simultaneously.

MicroarchitectureInteger Physical RegsFP/SIMD Physical Regs
Intel Skylake (2015)180168 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)280332 [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)Not publicly documentedNot publicly documented
AMD Zen 3 (2020)192160 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
AMD Zen 4 (2022)224192 [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)Not publicly documentedNot publicly documented
ARM Cortex-X4 (2023)Not publicly documentedNot publicly documented
Apple M1 Firestorm (2020)~380 (estimated)~434 (estimated) [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

Apple’s Firestorm core has remarkably large register files, consistent with its oversized ROB (~630 entries). A large register file is required to support a large ROB; otherwise, register exhaustion would stall allocation before the ROB fills. Intel’s jump from 180 to 280 integer registers between Skylake and Golden Cove reflects the corresponding ROB expansion from 224 to 512 entries [Source: WikiChip, “Golden Cove”, 2021].

At the heart of renaming sits the Register Alias Table (RAT), sometimes called the rename map. The RAT is a lookup table indexed by architectural register name. Each entry holds the physical register number that currently represents that architectural register.

When a new instruction enters the rename stage, the allocator performs the following steps:

  1. Read source mappings. For each source operand, look up the architectural register in the RAT to find the current physical register. This is the register the instruction will actually read from.
  2. Allocate a new physical register. Grab a free physical register from the free list (a pool of available physical registers).
  3. Update the RAT. Write the newly allocated physical register number into the RAT entry for the instruction’s destination architectural register. Subsequent instructions that read this architectural register will now be directed to the new physical register.

After renaming, the example above effectively becomes:

ADD  P32, P1, P2      ; (1) RAX→P32 = RBX(P1) + RCX(P2)
MUL  P33, P32, P3     ; (2) RBX→P33 = P32 * RDX(P3)     — RAW on P32
SUB  P34, P32, P33    ; (3) RCX→P34 = P32 - P33          — RAW on P33
MOV  P35, P6          ; (4) RAX→P35 = RDI(P6)            — no conflict!
ADD  P36, P35, P34    ; (5) RDX→P36 = P35 + P34          — RAW on P35

Notice that instructions (1) and (4) no longer collide on RAX — they write to distinct physical registers P32 and P35. The WAW hazard has been completely eliminated. Similarly, any WAR hazards vanish because old values persist in their original physical registers until those registers are freed at retirement.

Register Renaming (RAT)

Step through instructions to see how a Register Alias Table eliminates false dependencies (WAR/WAW) while preserving true dependencies (RAW).

Instructions
InstructionsADD RAX, RBX, RCXMUL RBX, RAX, RDXSUB RCX, RAX, RBXMOV RAX, RDIADD RDX, RAX, RCXRegister Alias Table (RAT)Arch RegPhys RegRAXRBXRCXRDXRSIRDIRBPRSPPhysical registers remaining: 64
Register renaming ready. Step through to see how the RAT maps architectural to physical registers.
RAW (true dependency)WAR/WAW (eliminated by renaming)Newly allocated

How Renaming Eliminates False Dependencies

The key insight is that renaming converts the dependency graph from one constrained by register names to one constrained only by actual data flow. After renaming:

  • True (RAW) dependencies remain — as they must. If instruction B needs the result of instruction A, B still waits for A to produce its value, regardless of renaming.
  • WAR dependencies disappear. The old value lives in its physical register, undisturbed by new writes to the same architectural name.
  • WAW dependencies disappear. Each write targets a unique physical register, so write ordering no longer matters for correctness.

This transformation exposes the maximum amount of parallelism inherent in the instruction stream. The out-of-order scheduler (discussed in the Reservation Stations / Scheduler section) can now look at the renamed dependency graph and issue instructions purely based on operand readiness, without artificial serialization.

Allocation Width and the Free List

Renaming must occur at the full decode width of the machine. On a 6-wide core, up to six instructions must be renamed per cycle — meaning six RAT reads for sources, six free-list pops for destinations, and six RAT writes for new mappings, all within a single clock cycle. This makes the RAT one of the most heavily ported structures in the entire CPU, and its power consumption is non-trivial.

The free list is a FIFO or bit-vector that tracks which physical registers are currently unused. When an instruction is renamed, a physical register is removed from the free list. When an instruction retires and the previous mapping for the same architectural register is no longer needed, that old physical register is returned to the free list. If the free list is empty, the allocator must stall — this is functionally equivalent to the Reorder Buffer being full, as discussed in the next section.

Checkpoint and Recovery

When a branch misprediction or exception occurs, the RAT must be restored to a consistent state. Modern processors handle this in one of two ways:

  • Checkpoint-based recovery. A snapshot of the RAT is saved at each branch. On misprediction, the saved snapshot is restored instantly (within 1–2 cycles). This is fast but storage-intensive.
  • Walk-based recovery. The processor walks the ROB backwards, undoing rename mappings. This is cheaper in area but takes several cycles — proportional to the number of instructions that must be squashed.

Intel cores have historically used a combination of checkpoints for branches and walk-based recovery for exceptions. AMD Zen cores use a checkpoint approach. The speed of recovery directly impacts the effective cost of branch mispredictions, which ties back to the branch prediction discussion in Part 2.

Why This Matters for Performance

Register renaming is what makes out-of-order execution practical. Without it, a 6-wide superscalar core would be bottlenecked by the 16 architectural register names, unable to exploit most available parallelism. The size of the physical register file sets an upper bound on the number of in-flight instructions, making it a critical microarchitectural resource alongside the Reorder Buffer. When performance counters show stalls due to “register exhaustion,” it is the physical register file and free list that are the limiting factor.

The Reorder Buffer

Out-of-order execution lets a processor execute instructions in whatever order their data dependencies allow. But the programmer — and the ISA specification — expect instructions to appear to complete in program order. When an exception fires on instruction #47, the architectural state must reflect exactly the first 47 instructions having completed and nothing more. The Reorder Buffer (ROB) is the structure that bridges this gap between out-of-order execution and in-order architectural commitment.

The Role of the ROB

Every instruction that enters the backend is assigned an entry in the ROB, strictly in program order. The ROB is typically implemented as a circular buffer with a head pointer (the oldest instruction) and a tail pointer (the newest instruction). Instructions are allocated at the tail and retire from the head.

An ROB entry tracks several pieces of information:

  • Instruction identifier and its decoded form (or a pointer to it).
  • Completion status — whether the instruction has finished execution.
  • Destination register mapping — the physical register this instruction writes, plus the old physical register it replaced (needed for freeing registers at retirement and for recovery on misprediction).
  • Exception flags — whether the instruction faulted during execution.
  • Store buffer association — for store instructions, a pointer to the corresponding store buffer entry that holds the data to be written to memory.

The critical invariant is: instructions retire from the ROB head in strict program order, only when they have completed execution and are free of exceptions. This is what makes exceptions “precise” — a hallmark of modern architectures.

ROB Capacity: The ILP Window

The ROB size determines the processor’s instruction window — the maximum number of instructions that can be in flight simultaneously. This is one of the most important microarchitectural parameters because it directly governs how much instruction-level parallelism the core can extract.

Here are representative ROB sizes across recent microarchitectures:

MicroarchitectureROB Entries
Intel Skylake (2015)224 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)512 [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)576 [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)256 [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)320 [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)448 [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)~320 (estimated) [Source: Chips and Cheese, “Cortex-X4 Microarchitecture”, 2023]
Apple M1 Firestorm (2020)~630 [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]
Apple M4 (2024)Not publicly documented

Apple’s Firestorm core stands out with an enormous instruction window, reflecting a design philosophy that aggressively trades area and power for single-threaded ILP extraction. Intel has been rapidly catching up: Golden Cove more than doubled Skylake’s ROB from 224 to 512, and Lion Cove pushed further to 576 entries. AMD’s progression has been more measured but consistent: 256 (Zen 3) to 320 (Zen 4) to 448 (Zen 5) [Source: WikiChip, “Zen 5”, 2024].

Why does ROB size matter so much? Consider a cache-miss load that takes 200 cycles to resolve from main memory. During those 200 cycles, the processor can continue fetching, decoding, and executing independent instructions — but only if the ROB has enough entries to hold all of them. A 224-entry ROB (Skylake) fills up quickly under a long-latency stall; a 512-entry ROB (Golden Cove) can sustain useful work for much longer. When the ROB is full, the frontend stalls — no new instructions can be allocated, and the processor’s throughput drops toward zero until the head instruction completes.

Reorder Buffer (ROB)

Instructions enter in order, execute out of order (varying latencies), and retire strictly in order from the head. Increase the LOAD latency to see the ROB fill up.

Configuration
ROB Entries (0/16)Retired: 0#0#1#2#3#4#5#6#7#8#9#10#11#12#13#14#15ROB Utilization: 0%
Reorder buffer empty. Step forward to dispatch instructions.
WaitingExecutingCompletedHead (retire pointer)

In-Order Retirement

The demo above illustrates the core tension. Instructions enter the ROB in order, execute out of order (notice different latencies), but retire strictly from the head. When a long-latency instruction sits at the ROB head:

  1. Instructions behind it continue executing and their entries are marked “completed.”
  2. No completed instruction can retire until the head instruction finishes.
  3. As more instructions pile up, the ROB fills toward capacity.
  4. Once the ROB is full, allocation stalls and the frontend stops.

This is the classic ROB head-of-line blocking problem. It is why long-latency operations — especially cache-miss loads — are so devastating to performance. The instruction itself only occupies one execution unit, but it blocks the retirement of hundreds of other instructions that have already finished.

Retirement Width

Retirement does not happen one instruction at a time. Modern cores retire multiple instructions per cycle — typically matching or exceeding the rename/allocation width. Intel Golden Cove retires up to 8 micro-ops per cycle. AMD Zen 4 retires up to 8. This retirement width sets the theoretical maximum sustained throughput of the core: even if execution can burst to high IPC, sustained throughput is bounded by how fast instructions can leave the ROB. As discussed further in the Retirement (Commit) section, retirement is the point where architectural state becomes committed and resources are freed.

What Happens at Retirement

When the head entry is both completed and exception-free, the core performs the following at retirement:

  1. Architectural register file update. The physical register written by this instruction is now the architecturally committed value. The previous physical register for the same architectural name (saved during renaming, as described in the Allocation & Register Renaming section) can be returned to the free list.
  2. Store commit. If the retiring instruction is a store, the data in the store buffer is marked as eligible to be written out to the cache hierarchy. Stores only become visible to other cores at this point, ensuring memory ordering guarantees.
  3. Branch resolution finalization. If a branch that was predicted correctly, its prediction resources can be freed. If this is where a misprediction is detected (for some misprediction detection schemes), recovery begins.
  4. Resource deallocation. The ROB entry itself is freed, and the head pointer advances.

Precise Exceptions and Speculation

The ROB is what makes speculative execution safe. The processor can execute past branches that have not yet been resolved, past loads that might fault, and past instructions that might raise floating-point exceptions. None of these speculative results become architecturally visible until the instruction reaches the ROB head and retires. If an exception occurs:

  • The faulting instruction is marked in the ROB.
  • When it reaches the head, the exception is raised.
  • All instructions after the faulting instruction are flushed (their ROB entries are discarded, their physical registers are freed, and their effects are rolled back).
  • The RAT is restored to the state it held just before the faulting instruction (via checkpoints or ROB walk, as discussed in the Register Renaming section).

This guarantees that the exception handler sees a clean architectural state: exactly the instructions before the fault have been committed, and nothing after.

ROB as ILP Bottleneck

When analyzing performance bottlenecks, the ROB is a frequent culprit. Key indicators include:

  • Backend-bound stalls due to ROB full. Performance monitoring counters on Intel (e.g., RESOURCE_STALLS.ROB) can directly report cycles where allocation stalled because the ROB was at capacity.
  • High IPC code with bursts of long-latency operations. If the average IPC is high but periodically collapses, ROB saturation during cache misses is a likely cause.
  • Loop-carried dependencies spanning many iterations. If a loop body is large and each iteration depends on the previous, the ROB must hold many iterations simultaneously.

Increasing ROB size has diminishing returns — it consumes significant die area and power, and the wakeup and tag-matching logic scales poorly. This is one reason why CPU designers complement large ROBs with better prefetching and cache hierarchies rather than making the ROB arbitrarily large.

Reservation Stations / Scheduler

Once instructions have been renamed and allocated entries in the Reorder Buffer, they move into the scheduler — also known as the reservation stations in Intel terminology, or the issue queue in academic literature. This is the structure where instructions wait until all their operands are ready, at which point they are selected for execution. The scheduler is the beating heart of dataflow execution: it is where out-of-order behavior actually happens.

The Core Idea: Dataflow Execution

In a classic in-order pipeline, instructions issue in program order, and if an instruction’s operand is not ready, the entire pipeline stalls behind it. The scheduler inverts this model. Each instruction sits in a scheduler entry with tags identifying its source operands. When a producing instruction finishes and broadcasts its result tag, all waiting instructions compare their operand tags against the broadcast. If there is a match, the operand is marked ready. When all operands for an instruction are ready, that instruction becomes eligible for issue.

This is the hardware realization of dataflow execution: instructions fire as soon as their inputs are available, regardless of program order. The constraint is not “is it my turn?” but “do I have what I need?”

Wakeup and Select

The scheduler performs two critical operations every cycle:

Wakeup

When an execution unit produces a result, it broadcasts a completion tag — the physical register number of the result. Every scheduler entry compares this tag against its source operand tags. This comparison happens in parallel across all entries. If a match is found, the corresponding operand bit is flipped to “ready.”

On a machine with 100 scheduler entries and 6 execution ports, up to 6 tags may be broadcast simultaneously, and each of the 100 entries must compare against all 6 tags — that is 600 comparisons per cycle, implemented as content-addressable memory (CAM) logic. This is expensive in power and area, and it is one reason why scheduler size is limited.

Select

Among all instructions whose operands are now fully ready, the select logic chooses which ones to actually send to execution units. The select logic must respect port constraints (only certain operations can execute on certain ports, as discussed in the Execution Ports & Execution Units section) and typically uses an oldest-first priority policy. Issuing the oldest ready instruction first tends to drain the ROB head most quickly, reducing the chance of ROB-full stalls.

The wakeup and select logic must complete within a single clock cycle for the pipeline to remain efficient. This tight timing constraint is a major design challenge — it limits both the number of scheduler entries and the number of ports.

Speculative Wakeup

To keep the pipeline running at full speed, many designs use speculative wakeup: an instruction is woken up based on the expected completion time of its producer, before the result has actually been computed. For example, if a producing ALU operation has a 1-cycle latency, the scheduler can wake up dependent instructions in the same cycle that the producer is issued — the result will be available on the bypass network just in time.

This works well for instructions with fixed, predictable latencies (ALU ops, most FP ops). It fails for variable-latency operations like cache loads. If a load is assumed to hit the L1 cache (4–5 cycle latency) but actually misses, the speculatively woken instructions arrive at the execution unit and find no valid data on the bypass. The scheduler must then replay these instructions — squash their results and re-insert them for later issue. Replay penalties are a measurable source of wasted work on Intel cores.

Unified vs. Distributed Schedulers

There are two broad approaches to scheduler organization:

Unified Scheduler

A single large scheduler holds all instructions regardless of type. Any ready instruction is selected and routed to the appropriate execution port.

  • Advantage: Maximum flexibility. A burst of ALU instructions can use all scheduler capacity; a burst of FP instructions can do the same.
  • Disadvantage: The CAM logic must be large (all entries, all comparators), consuming significant power. Wiring delays increase with size.
  • Example: Intel cores from Skylake onward use a unified scheduler with approximately 97 entries (Golden Cove expanded this further). AMD Zen architectures also use a largely unified design.

Distributed Scheduler

Multiple smaller schedulers, each dedicated to specific instruction types (integer, FP/vector, load, store). Instructions are dispatched to the appropriate scheduler at allocation time.

  • Advantage: Smaller, faster CAMs. Lower power. Each scheduler can be tuned to its instruction type.
  • Disadvantage: Fragmentation. If one scheduler fills up while others are empty, instructions stall even though overall capacity exists.
  • Example: Some ARM big-core designs and older Intel Atom cores used distributed schedulers.

In practice, most high-performance designs have converged toward a mostly unified approach, sometimes with a separate scheduler for loads and stores due to their unique interaction with the memory subsystem.

Scheduler Depth and Performance

The number of scheduler entries determines how far ahead the processor can look for independent work. Consider this code:

; Long dependency chain
IMUL  RAX, RBX        ; (1) 3 cycles
IMUL  RAX, RCX        ; (2) depends on (1), 3 cycles
IMUL  RAX, RDX        ; (3) depends on (2), 3 cycles
; Independent work
ADD   R8, R9           ; (4) independent, 1 cycle
ADD   R10, R11         ; (5) independent, 1 cycle
ADD   R12, R13         ; (6) independent, 1 cycle

While instruction (2) waits for (1) and (3) waits for (2), instructions (4)–(6) are fully independent and can issue immediately. But the scheduler must hold all of these instructions simultaneously to exploit this parallelism. If the scheduler is too small, instructions (4)–(6) may not have been allocated yet by the time the IMUL chain blocks the ROB.

Typical modern scheduler sizes:

MicroarchitectureScheduler TypeScheduler Entries
Intel Skylake (2015)Unified97 entries [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)Unified~160 entries [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)UnifiedNot publicly documented
AMD Zen 3 (2020)Distributed (INT: 4x24, FP: 2x32)96 INT + 64 FP = 160 total [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)Distributed (INT: 4x24, FP: 2x32)96 INT + 64 FP = 160 total [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)DistributedNot publicly documented [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
Apple M1 Firestorm (2020)Unified (estimated)~330 entries (estimated) [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

Intel and AMD take fundamentally different approaches here. Intel uses a unified scheduler where all instruction types (integer, FP, load, store) share a single pool of entries. AMD uses a distributed scheduler where integer ALU operations, floating-point operations, and address generation units each have their own smaller schedulers. AMD’s Zen 3/4 has 4 integer schedulers with 24 entries each (96 total integer) and 2 FP schedulers with 32 entries each (64 total FP), plus separate address generation queues [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023].

The distributed approach means AMD cores can suffer from scheduler fragmentation: if all integer schedulers are full but FP schedulers are empty, integer instructions stall even though total scheduler capacity exists. The unified approach avoids this at the cost of larger, more power-hungry CAM logic.

Again, Apple’s design stands out with an exceptionally large scheduling window, consistent with its philosophy of maximizing single-thread ILP.

The Scheduler in the Pipeline

The scheduler sits between the rename/allocate stage and the execution units:

  1. Allocate — Instructions are placed into both the ROB (for in-order tracking) and the scheduler (for out-of-order issue).
  2. Wait — The instruction monitors tag broadcasts, waiting for operands.
  3. Select — Once ready, the instruction is selected and sent to an execution port.
  4. Execute — The instruction runs on a functional unit. Its result tag is broadcast back to the scheduler for wakeup.
  5. Deallocate — After issue, the scheduler entry is freed (the instruction still lives in the ROB until retirement).

Note that the scheduler entry is freed at issue, not at retirement. This means scheduler capacity constrains the number of instructions actively waiting for operands, not the total number in flight — that is the ROB’s job. The interplay between scheduler size and ROB size is a key microarchitectural balancing act.

Execution Ports & Execution Units

When the scheduler selects a ready instruction for execution, it dispatches it through an execution port to a functional unit (also called an execution unit). The mapping between ports and functional units defines the core’s execution capabilities — what operations it can perform in parallel, what the throughput of each operation type is, and where contention bottlenecks arise.

Ports vs. Units

A common source of confusion is the distinction between ports and units. An execution port is a dispatch slot — a point of entry into the execution backend. Each port may have one or more functional units behind it. For example, a single port might host both a simple ALU (for ADD, AND, XOR) and a branch unit (for JMP, JCC). Both cannot execute simultaneously on the same port in the same cycle, but they share the port because their instruction mixes rarely conflict heavily.

In any given cycle, each port can accept at most one micro-op. The number of ports therefore determines the maximum dispatch width — how many micro-ops can begin execution per cycle.

Intel Port Layout: Skylake

Intel’s Skylake (6th gen, 2015) features 8 execution ports [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]:

PortFunctional Units
Port 0ALU, Integer Multiply, FP Multiply, FP Divide, Vector Logical, Branch
Port 1ALU, Integer Multiply, FP Add, Slow LEA, Vector Logical
Port 2Load (AGU + Load Data)
Port 3Load (AGU + Load Data)
Port 4Store Data
Port 5ALU, Vector Shuffle, Vector Logical
Port 6ALU, Branch, Integer Divide
Port 7Store Address (AGU, simple addressing only)

Key difference from Golden Cove: Skylake has only 4 ALU-capable ports (0, 1, 5, 6) vs. Golden Cove’s wider set, and only 1 store data port vs. Golden Cove’s 2. The branch execution is split across Ports 0 and 6 [Source: Intel, Optimization Reference Manual, 2023].

Intel Port Layout: Golden Cove

Intel’s Golden Cove microarchitecture (Alder Lake P-cores, 12th/13th gen) features a wide execution engine:

PortFunctional Units
Port 0ALU, Integer Multiply, FP/Vec Multiply, Branch
Port 1ALU, Integer Multiply, FP/Vec Add, Slow LEA
Port 2Load (AGU)
Port 3Load (AGU)
Port 4Store Data
Port 5ALU, Vector Shuffle, FP/Vec Multiply
Port 6ALU, Branch, Integer Divide
Port 7Store Address (AGU)
Port 8Store Address (AGU)
Port 9Store Data

With up to 6 ports capable of ALU work and 2 load ports, Golden Cove can sustain very high throughput on integer-heavy code. The 10-port (effective) design represents one of the widest x86 backends ever built [Source: Intel, Optimization Reference Manual, 2023].

AMD Port Layout: Zen 3

AMD’s Zen 3 (Ryzen 5000, 2020) uses a partitioned integer/FP design with separate schedulers [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]:

Integer cluster:

PipeFunctional Units
ALU 0Integer ALU, Branch
ALU 1Integer ALU, Multiply
ALU 2Integer ALU, Divide
ALU 3Integer ALU
AGU 0Load/Store Address
AGU 1Load/Store Address
AGU 2Store Address

FP/Vector cluster:

PipeFunctional Units
FP 0FADD, FP Multiply (FMUL/FMA)
FP 1FADD, FP Multiply (FMUL/FMA)
FP 2FP Store, Vector Shuffle
FP 3FP Store, Vector Misc

Zen 3 increased the number of FP multipliers from 1 (Zen 2) to 2, doubling FP throughput. The integer cluster provides 4 ALU pipes and 3 AGU pipes [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023].

AMD Port Layout: Zen 4

AMD’s Zen 4 (Ryzen 7000) takes a somewhat different approach:

PortFunctional Units
ALU 0Integer ALU, Branch
ALU 1Integer ALU, Multiply
ALU 2Integer ALU, Divide
ALU 3Integer ALU
AGU 0Load/Store Address
AGU 1Load/Store Address
AGU 2Load/Store Address
FP 0FP/Vec Add, FP Multiply
FP 1FP/Vec Add, FP Multiply
FP 2FP Store, FP Shuffle

Zen 4 uses a more explicitly partitioned model: integer and FP/vector execution are separate clusters with their own schedulers and register files. This reduces cross-domain latency but means that integer and FP resources cannot be shared.

Execution Ports

Watch micro-ops dispatch to 6 execution ports cycle by cycle. Each port handles specific operation types. Observe port utilization and IPC.

IPC: 0.00Dispatched: 0Cycles: 0P00%P10%P20%P30%P40%P50%
Port 0ALU / BranchPort 1ALU / MulPort 2LoadPort 3Store AddrPort 4Store DataPort 5ALU / FPStep forward to begin dispatching
Execution ports idle. Step forward to dispatch micro-ops.
ALUMultiplyLoadStore AddrStore DataBranchFP

Throughput vs. Latency

Two performance characteristics define each functional unit:

  • Latency — How many cycles from when the instruction begins execution until its result is available on the bypass network for dependent instructions. For a simple ALU ADD, this is 1 cycle. For an integer multiply (IMUL), it is typically 3 cycles. For a floating-point divide, it can be 10–20+ cycles depending on precision.
  • Throughput — How often a new instruction can be issued to the unit, expressed as instructions per cycle or its reciprocal (cycles per instruction). A pipelined multiplier has a latency of 3 but a throughput of 1 per cycle — a new multiply can start every cycle even though each takes 3 cycles to complete. A divider, by contrast, is typically not pipelined: a new divide cannot start until the previous one finishes.

This distinction is critical for performance analysis. Consider:

; Case A: Independent multiplies
IMUL  RAX, RBX        ; 3-cycle latency, 1/cycle throughput
IMUL  RCX, RDX        ; independent — can issue next cycle
IMUL  RSI, RDI        ; independent — can issue cycle after

; Case B: Dependent multiplies
IMUL  RAX, RBX        ; 3-cycle latency
IMUL  RAX, RCX        ; depends on previous RAX — must wait 3 cycles
IMUL  RAX, RDX        ; depends on previous RAX — must wait 3 more cycles

Case A completes all three multiplies in 5 cycles (first result at cycle 3, last at cycle 5). Case B takes 9 cycles because each multiply depends on the previous result. The throughput is the bottleneck in Case A; the latency is the bottleneck in Case B.

Port Contention

Port contention occurs when multiple ready instructions need the same port in the same cycle. Only one can win; the others must wait an additional cycle. This is a structural hazard — a resource conflict rather than a data dependency.

Port contention is a real performance concern in tight loops. Consider a loop body with three ALU operations and a branch, all of which can only execute on Ports 0 and 6 (on a Skylake-era design). Even though the scheduler has plenty of entries and all operands are ready, only 2 of these 4 operations can execute per cycle, creating a throughput bottleneck of 2 cycles per iteration.

This is why optimizing compilers care about port pressure — they try to select instruction variants that spread work across ports. For example, choosing LEA instead of ADD for address computation can move work from a congested ALU port to a different port that handles LEA operations.

You can observe port pressure using hardware performance counters:

  • Intel: UOPS_DISPATCHED_PORT.PORT_* counters report how many micro-ops were dispatched to each port.
  • AMD: Similar counters exist under the FpuPipeAssignment event family.

Tools like llvm-mca (LLVM Machine Code Analyzer) and Intel’s IACA can statically analyze a loop body and predict which ports are bottlenecked.

Multi-Cycle and Non-Pipelined Units

Most functional units are pipelined: they can accept a new instruction every cycle, even though each instruction takes multiple cycles to complete. The multiply unit is a classic example — 3 cycles of latency, but fully pipelined at 1-per-cycle throughput.

Some units are partially pipelined or non-pipelined:

  • Integer division — Typically 20–90+ cycles depending on operand size, and the unit cannot accept a new divide until the current one completes (or is nearly complete).
  • FP division and square root — Similar to integer division, though modern designs have improved throughput with partial pipelining (e.g., one DIVSS every 3–5 cycles on recent Intel).
  • Certain vector operations — Some complex shuffle or permutation operations may have multi-cycle throughput.

Non-pipelined units are a bottleneck only when the same operation appears frequently. In most code, divisions are rare enough that the unit is not a concern. But in specific workloads (e.g., normalization-heavy graphics code), division throughput can become the critical path.

IPC Impact

The number and variety of execution ports directly determine the theoretical maximum IPC of the core. A core with 6 execution ports can issue at most 6 micro-ops per cycle. But actual IPC is almost always lower due to:

  1. Data dependencies — Instructions must wait for operands (latency-bound).
  2. Port contention — Too many operations of the same type compete for the same port.
  3. Frontend bottlenecks — The frontend cannot supply micro-ops fast enough (discussed in Part 2).
  4. Memory latency — Cache misses stall execution, eventually filling the ROB (discussed in the Reorder Buffer section).

Achieving even 50–60% of theoretical peak IPC on general-purpose code is considered excellent. SPEC-rate benchmarks on Golden Cove typically achieve IPC in the 4–5 range on a core capable of dispatching 6+ micro-ops per cycle. Understanding port-level behavior is essential for closing the gap between theoretical and achieved throughput.

Retirement (Commit)

Retirement — also called commit — is the final stage of an instruction’s lifecycle in an out-of-order core. It is the point at which speculative, out-of-order results become architecturally committed: visible to the programmer, to exception handlers, and to other cores in the system. Everything before retirement is provisional and can be rolled back. Everything after retirement is permanent.

Despite being conceptually simple (“mark the instruction as done and free its resources”), retirement is a critical bottleneck and the ultimate arbiter of processor throughput.

In-Order Retirement from the ROB Head

As discussed in the Reorder Buffer section, the ROB is a circular buffer where instructions are allocated in program order. Retirement occurs strictly from the head of the ROB. The retirement logic examines the head entry each cycle and checks:

  1. Has the instruction completed execution? If not, retirement stalls — no instruction behind it can retire either.
  2. Did the instruction raise an exception? If so, the exception is taken, and all younger instructions are flushed.
  3. For stores: is the store buffer ready to commit? The store’s data and address must both be resolved.

If all checks pass, the instruction retires: its results become architecturally visible, and its resources are freed. The head pointer advances to the next entry, and the process repeats — potentially retiring multiple instructions in the same cycle.

Retirement Width

Modern cores retire multiple instructions per cycle to avoid the retirement stage becoming a throughput ceiling. Representative retirement widths:

MicroarchitectureRetirement Width (micro-ops/cycle)
Intel Skylake (2015)4 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)8 [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)8 [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)8 macro-ops [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)8 macro-ops [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)8 macro-ops [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)8 [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)8+ (estimated) [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

The significance of retirement width is often underappreciated. Even if the execution backend can sustain 6 or more micro-ops per cycle, a 4-wide retirement stage (as in Skylake) means that over time, no more than 4 micro-ops per cycle can leave the pipeline. This creates back-pressure: the ROB fills up, eventually stalling allocation. Intel’s jump from 4-wide to 8-wide retirement in Golden Cove was a major contributor to that generation’s IPC improvement.

Note that retirement width is a sustained throughput constraint, not a burst constraint. Execution can burst above the retirement rate temporarily (instructions pile up as “completed” in the ROB), but sustained IPC cannot exceed the retirement rate.

What Happens During Retirement

The retirement of a single instruction involves several coordinated actions:

1. Freeing Physical Registers

During register renaming (as described in the Allocation & Register Renaming section), each instruction that writes a register causes a new physical register to be allocated. The old physical register — the one that previously mapped to the same architectural register — is not immediately freed because in-flight instructions might still be reading it.

At retirement, the old physical register is guaranteed to be no longer needed by any active instruction (all older instructions have already retired, and younger instructions reference the new mapping). So the old physical register is returned to the free list, making it available for future renaming.

This is the mechanism that prevents the physical register file from running out: retirement recycles registers at the same rate they are consumed by allocation. If retirement stalls, registers are not freed, the free list empties, and allocation eventually stalls too — another form of ROB-full backpressure.

2. Committing Stores to the Memory Hierarchy

Store instructions occupy entries in the store buffer from the time they are dispatched. During execution, the store address and data are written into the store buffer, but the data is not written to the cache. The store buffer serves as a speculative holding area.

At retirement, the store is marked as committed (also called “senior”) in the store buffer. Only committed stores are eligible to be written out to the L1 data cache. This ensures that:

  • Speculative stores never pollute the cache. If a branch misprediction is detected, speculative (uncommitted) stores are simply discarded from the store buffer.
  • Memory ordering is preserved. Stores become visible to other cores in program order, which is essential for x86’s relatively strong Total Store Order (TSO) memory model.

The actual drain from the store buffer to the cache may happen some cycles after retirement — the store buffer decouples retirement from the cache write. But the commitment at retirement is the point of no return.

3. Updating Architectural State

Retirement is when the architectural register file (or retirement register file) is updated. While the speculative, renamed state lives in the physical register file and the RAT, the architectural state is the one that is visible during exceptions and interrupts. On some implementations, the architectural register file is implicit (the RAT checkpoint at the ROB head constitutes the architectural state); on others, it is an explicit structure updated at retirement.

4. Deallocating ROB and Scheduler Resources

The ROB entry is freed, advancing the head pointer. If any associated scheduler resources were still held (uncommon — scheduler entries are typically freed at issue), those are released as well.

Retirement as the Throughput Ceiling

Consider a concrete scenario:

; Tight loop body: 5 micro-ops
.loop:
    VMULPS  YMM0, YMM1, [RSI]     ; 1 uop (load + multiply fused)
    VADDPS  YMM2, YMM0, YMM3      ; 1 uop
    VMOVUPS [RDI], YMM2           ; 1 uop (store)
    ADD     RSI, 32               ; 1 uop
    ADD     RDI, 32               ; 1 uop
    DEC     RCX                    ; (fused with JNZ below)
    JNZ     .loop                  ; 1 uop (macro-fused)

This loop has roughly 6 micro-ops per iteration. On Golden Cove with 8-wide retirement, roughly 1.3 iterations can retire per cycle (8/6). On Skylake with 4-wide retirement, only 0.67 iterations retire per cycle (4/6) — the retirement stage is the bottleneck even if the execution units could handle more.

In practice, other bottlenecks (execution port contention, memory latency, frontend throughput) usually dominate before retirement width becomes the constraint. But for highly optimized inner loops with no cache misses and no branch mispredictions, retirement width can become the limiting factor.

Retirement and Branch Misprediction Recovery

When a branch misprediction is detected, the processor must recover to a known-good state. The interaction with retirement depends on the detection mechanism:

  • Early detection (at execution time): The branch resolves during execution, and the mismatch with the prediction is detected immediately. All instructions younger than the mispredicted branch are flushed from the ROB, scheduler, and pipeline. The branch instruction itself is not yet retired — it still sits in the ROB and will retire normally once the corrected path reaches it.
  • Late detection (at retirement): Some implementations defer certain checks to retirement. This is simpler but adds latency to the recovery path.

In either case, the cost of a misprediction is proportional to the depth of the pipeline and the number of instructions that must be flushed. The ROB’s role is to enable this recovery — without it, there would be no record of the speculative instructions to discard, and no way to restore the architectural state.

Observing Retirement Bottlenecks

Performance analysis tools can reveal when retirement is the bottleneck:

  • Intel VTune / Top-Down Microarchitecture Analysis (TMA): The “Retiring” category in TMA indicates what fraction of pipeline slots were used by instructions that actually retired. A high Retiring percentage (>70%) indicates the backend is being used efficiently. A low Retiring percentage with high Backend Bound suggests resources are stalled waiting for something — often the ROB head.
  • UOPS_RETIRED.RETIRE_SLOTS: This counter measures how many micro-ops actually retired per cycle, giving you the effective retirement throughput.
  • RESOURCE_STALLS.ROB: Reports cycles where the allocator stalled because the ROB was full — a direct indicator that retirement cannot keep up with allocation.

The ROB demo in the Reorder Buffer section lets you observe retirement behavior interactively: watch how the HEAD pointer advances only when the oldest instruction completes, and how increasing the latency of an instruction at the head causes the entire buffer to fill up, blocking further progress.

Summary

Retirement is where the speculative, out-of-order world of the backend converges back to the ordered, deterministic world of the ISA. It is a deceptively simple concept — “confirm the oldest instruction and free its resources” — but its width, throughput, and interaction with the store buffer and register free list make it a fundamental constraint on processor performance. Every micro-op that enters the pipeline must eventually pass through this narrow gate, and the rate at which it can do so sets the ultimate ceiling on sustained instruction throughput.

Why Caches Exist

Main memory (DRAM) delivers data in roughly 50-100 nanoseconds — an eternity for a core that retires an instruction every fraction of a nanosecond. A 4 GHz core waiting on a 70 ns memory access wastes approximately 280 cycles doing nothing. Caches exist to bridge this gap by keeping recently or frequently accessed data in small, fast SRAM arrays physically close to the execution units. Modern CPUs organize these arrays into a multi-level cache hierarchy: L1, L2, and L3 (sometimes called LLC — Last-Level Cache).

Cache Lines: The Unit of Transfer

Caches do not operate on individual bytes. The fundamental unit is the cache line, typically 64 bytes on x86 and most ARM designs. When the CPU loads a single byte from address 0x1000, the entire 64-byte block [0x1000, 0x103F] is fetched into the cache. This design exploits spatial locality — if you touched one byte, you will likely touch its neighbors soon.

Cache line size creates practical consequences. A struct that straddles two cache lines requires two fetches. Aligning hot data structures to 64-byte boundaries avoids these split-line penalties:

struct alignas(64) HotData {
    uint64_t counter;
    uint64_t timestamp;
    // ... fits in one cache line
};

Associativity

Each cache line maps to a specific set in the cache. Associativity determines how many slots (ways) exist within each set:

  • Direct-mapped (1-way): Each address maps to exactly one slot. Fast lookup, but high conflict miss rate.
  • N-way set-associative: Each address maps to a set containing N slots. The hardware checks all N ways in parallel. Typical values: L1d is 8-way or 12-way, L2 is 8-way to 16-way, L3 is 12-way to 16-way.
  • Fully associative: A line can go anywhere. Used only for tiny structures like TLBs (see Section 17: TLB).

Higher associativity reduces conflict misses at the cost of more comparators and slightly higher access latency. The replacement policy (usually pseudo-LRU or adaptive policies in L3) decides which way to evict when the set is full.

The Three Levels

L1 Data Cache (L1d)

The L1d cache sits closest to the execution units and is optimized for latency above all else. Typical parameters on modern cores:

PropertyTypical Value
Size32-48 KB per core
Associativity8-12 way
Latency4-5 cycles
Bandwidth2 loads + 1 store/cycle

The L1d must sustain two loads and one store per cycle to keep the backend fed. It achieves this through banked designs — the SRAM is divided into banks that can service independent accesses in parallel. The L1 instruction cache (L1i) is a separate structure discussed in Instruction Fetch.

L2 Cache

The L2 is a unified (instructions + data) cache, private to each core. It acts as the victim cache or backing store for L1 misses.

PropertyTypical Value
Size256 KB - 2 MB per core
Associativity8-16 way
Latency12-14 cycles
Bandwidth64 bytes/cycle

Recent designs (Intel Golden Cove, AMD Zen 4) have pushed L2 sizes to 1.25-2 MB per core, recognizing that server and desktop workloads benefit significantly from larger private caches.

L3 Cache (LLC)

The L3 is typically shared across all cores in a die or chiplet. It serves two purposes: acting as a backstop for L2 misses, and providing a coherence point so that cores can find data modified by other cores without going to main memory.

PropertyTypical Value
Size16-96+ MB shared
Associativity12-16 way
Latency30-50+ cycles

AMD’s Zen 3/4 uses a per-CCX L3 of 32 MB (8 cores sharing). Intel’s designs slice the L3 across cores, with each slice adding ~1.5-3 MB. AMD’s 3D V-Cache stacks additional SRAM to reach 96 MB of L3, dramatically improving workloads with large working sets.

Inclusion and Exclusion Policies

The relationship between cache levels matters:

  • Inclusive: Every line in L1 is guaranteed to also be in L3. Simplifies coherence (only need to snoop L3) but wastes capacity since the small L1 contents are duplicated. Intel’s traditional designs were inclusive.
  • Exclusive (strict): A line exists in exactly one level. AMD’s Zen family uses a mostly-exclusive L3, maximizing effective capacity (L2 + L3 rather than just L3).
  • Non-inclusive (NINE): L3 does not guarantee inclusion but does not actively evict from L1 either. Intel adopted this starting with Skylake Server. A back-invalidation is sent to L1/L2 only when a line is evicted from L3.

Understanding the policy matters when reasoning about effective cache capacity. With a non-inclusive 2 MB L2 and 32 MB L3, the effective capacity is closer to 34 MB. With strict inclusion, it is just 32 MB.

Access Latency in Practice

Latency compounds at each level. A rough mental model for a modern x86 core at 4-5 GHz:

LevelLatency (cycles)Latency (ns)
L1d4-5~1 ns
L212-14~3 ns
L330-50~10 ns
DRAM200-400~60-80 ns

Cache Parameters Across Real Processors

The table below compares cache sizes, associativity, and access latencies across representative microarchitectures.

MicroarchitectureL1d Size / Assoc.L1d LatencyL2 Size / Assoc.L2 LatencyL3 Size (per core share)
Intel Skylake (2015)32 KB / 8-way4 cyc256 KB / 4-way12 cyc2 MB/core (shared) [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)48 KB / 12-way5 cyc1.25 MB / 10-way14 cyc3 MB/core (shared) [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)48 KB / 12-way5 cyc2 MB / 16-way~14 cycNot publicly documented [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)32 KB / 8-way4 cyc512 KB / 8-way12 cyc4 MB/core (32 MB per 8-core CCX) [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)32 KB / 8-way4 cyc1 MB / 8-way12 cyc4 MB/core (32 MB per CCX) [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)48 KB / 12-way4 cyc1 MB / 8-way~14 cyc4 MB/core (32 MB per CCX) [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)64 KB / 4-way~4 cyc512 KB-2 MB / 8-way~10-12 cycShared L3 (varies by SoC) [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)128 KB / 8-way~3 cyc12 MB (shared across 4 P-cores, functions as LLC) / 12-way~15 cycN/A (L2 serves as LLC) [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

All caches above use 64-byte cache lines [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023; ARM, Cortex-X4 TRM, 2023].

Notable trends: Intel expanded L1d from 32 KB to 48 KB and L2 from 256 KB to 1.25 MB between Skylake and Golden Cove. AMD doubled L2 from 512 KB to 1 MB between Zen 3 and Zen 4. Apple’s M1 Firestorm has a remarkably large 128 KB L1d — 4x larger than typical x86 L1d caches — reflecting the wider core and the absence of micro-op cache overhead [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]. ARM Cortex-X4 also uses a 64 KB L1d, larger than x86 implementations [Source: ARM, Cortex-X4 TRM, 2023].

Every algorithm that fits in L1 runs an order of magnitude faster than one that spills to DRAM. This is why data layout and access patterns often matter more than instruction count. Techniques like loop tiling (covered in Section 25: Case Studies) restructure computation to maximize cache reuse.

Cache Hierarchy

Visualize how data flows through the L1/L2/L3 cache hierarchy. Misses cascade downward to slower, larger levels.

CPU CoreL1d Cache32 KB · 4 cy--L2 Cache256 KB · 12 cy--L3 Cache (shared)2 MB · 40 cy--missmissmissMain Memory200 cy--L1: -- · L2: -- · L3: -- · Mem: 0Recent access latencies (cycles)
Cache hierarchy ready. Step through to generate accesses.
L1 hit (4 cy)L2 hit (12 cy)L3 hit (40 cy)Memory (200 cy)

Key Takeaways

  • Cache lines are 64 bytes; align and pack hot data accordingly.
  • Associativity trades lookup speed for reduced conflict misses.
  • L3 sharing and inclusion policies vary by architecture — measure, do not assume.
  • The latency gap between L1 and DRAM is roughly 60x; keeping data in cache is the single most impactful optimization for most workloads.

The Problem Stores Create

When the CPU executes a store instruction, it cannot simply write to the L1 data cache immediately. The store must first be confirmed as non-speculative (committed), and the cache line must be in the correct coherence state (Modified or Exclusive in a MESI-family protocol). Both of these conditions may take many cycles to resolve. If the CPU stalled on every store, throughput would collapse.

The store buffer (also called the store queue) solves this by decoupling store execution from cache writes. Stores are written into the store buffer at execution time and drained to the cache in the background, after retirement. This allows the pipeline to continue executing instructions without waiting for the memory subsystem to absorb each store.

Store Buffer Structure

The store buffer is a small, ordered FIFO-like structure. Typical sizes on modern cores:

MicroarchitectureStore Buffer EntriesLoad Buffer Entries
Intel Skylake (2015)5672 [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)72128 [Source: Chips and Cheese, “Golden Cove Microarchitecture”, 2022]
Intel Lion Cove (2024)Not publicly documentedNot publicly documented [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)6444 [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)6488 [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
AMD Zen 5 (2024)Not publicly documentedNot publicly documented [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
Apple M1 Firestorm (2020)Not publicly documentedNot publicly documented

Intel Golden Cove significantly expanded both buffers compared to Skylake (store buffer: 56 to 72, load buffer: 72 to 128), reflecting the wider backend and larger ROB [Source: WikiChip, “Golden Cove”, 2021]. AMD Zen 4 doubled the load buffer from 44 to 88 entries compared to Zen 3, improving the core’s ability to sustain many in-flight loads [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022].

Each entry holds the address, data, and size of a pending store. Entries are allocated at dispatch (when the store micro-op enters the backend) and freed when the data is successfully written to the L1d cache.

When the store buffer fills up, the pipeline stalls — no new store micro-ops can be dispatched. This is a store buffer capacity stall and appears in performance counters as a backend-bound resource limitation. Workloads with high store rates (e.g., memcpy, zeroing buffers, scatter operations) frequently encounter this bottleneck.

The Load Buffer

Loads have a parallel structure: the load buffer (or load queue). Modern cores have 72-128 load buffer entries. Each entry tracks an in-flight load from dispatch until the data is delivered and the load retires.

The load buffer serves several purposes:

  1. Tracks pending loads awaiting cache responses.
  2. Detects memory ordering violations — if a later load executed before an earlier store to the same address, the load must be replayed (see Section 19: Memory Ordering).
  3. Supports load speculation — loads may execute before all prior store addresses are known, requiring the load buffer to perform checks when those store addresses resolve.

Store-to-Load Forwarding

One of the most performance-critical mechanisms in the memory subsystem is store-to-load forwarding (STLF). When a load executes, the hardware checks the store buffer for any pending store to the same address. If a match is found, the data is forwarded directly from the store buffer entry to the load, bypassing the cache entirely.

mov [rax], rcx      ; store to address in rax
mov rdx, [rax]      ; load from same address -- forwarded from store buffer

The load in this example does not wait for the store to reach L1d. Instead, the store buffer supplies the data in approximately 4-5 cycles, matching L1d latency. Without forwarding, the load would stall until the store drained to cache.

When Forwarding Fails

Store-to-load forwarding is not always possible. Common failure cases include:

  • Partial overlap: The load reads a range that partially overlaps but does not fully contain the stored data (e.g., a 4-byte store followed by an 8-byte load from the same base address).
  • Misaligned forwarding: Some microarchitectures cannot forward across cache line boundaries.
  • Size mismatch with offset: A narrow store followed by a wider load starting at a different offset within the same bytes.

Failed forwarding typically adds a penalty of 10-15 additional cycles, as the hardware must wait for the store to drain to cache and then service the load normally. This is a common source of unexpected latency in code that uses type-punning, union tricks, or untyped memory buffers:

// Potential STLF failure: store as float, load as int
float *fp = (float *)buf;
int   *ip = (int *)buf;
*fp = 3.14f;
int bits = *ip;  // may or may not forward depending on uarch

Store Buffer as a Performance Limiter

The store buffer capacity directly limits the number of stores the core can have in-flight. Consider a loop that writes to memory:

for (int i = 0; i < N; i++) {
    output[i] = compute(input[i]);
}

If compute is fast but each iteration produces a store, and the stores miss in L1d (requiring cache line fetches from L2 or L3), the store buffer fills while waiting for cache lines to arrive. Once full, the pipeline stalls even if compute resources are available.

Mitigation strategies include:

  • Non-temporal stores (movnt on x86): bypass the cache entirely, writing directly to a write-combining buffer. Useful for streaming writes to data that will not be re-read soon.
  • Loop tiling: restructure access patterns to stay within cache (see Section 25: Case Studies).
  • Reducing store count: merge multiple narrow stores into wider ones; use vector stores.

Memory Disambiguation

The CPU must maintain the illusion that loads and stores execute in program order, even though out-of-order execution may reorder them. The memory disambiguator determines whether a load can safely execute before an older store whose address is not yet known.

On Intel cores, the memory disambiguator uses a predictive approach: it speculates that loads do not alias with unknown-address stores. If this prediction is wrong (a memory order violation), the load and all dependent instructions are flushed and re-executed — similar to a branch misprediction penalty.

AMD’s Zen cores use a similar speculative disambiguation mechanism, tracking alias history to improve prediction accuracy over time.

Store Buffer & Forwarding

Stores accumulate in the store buffer. When a load address matches a pending store, data is forwarded directly (green path). Otherwise the load goes to cache.

Store Buffer (0 entries)(empty)L1 Cache(load on miss)IDLE
Store buffer ready.
Forwarding hits: 0Cache loads: 0Buffer depth: 0

Key Takeaways

  • The store buffer decouples stores from cache writes, keeping the pipeline moving.
  • Store buffer capacity (56-72 entries) limits in-flight store count; exhaustion causes stalls.
  • Store-to-load forwarding is critical for performance but fails on partial overlaps and size mismatches.
  • Memory disambiguation allows loads to execute speculatively before prior stores resolve, at the risk of pipeline flushes on misprediction.

Virtual Memory and the Translation Problem

Modern operating systems give each process its own virtual address space. Every memory access issued by the CPU uses a virtual address that must be translated to a physical address before the cache or DRAM can be accessed. This translation is defined by page tables — multi-level tree structures maintained by the OS in main memory.

On x86-64, the page table walk traverses 4 or 5 levels (PML4/PML5, PDPT, PD, PT), each requiring a memory read. A full 4-level walk means 4 dependent DRAM accesses — potentially 250+ cycles of latency. If every load and store paid this cost, the machine would be unusable. The TLB (Translation Lookaside Buffer) caches recent translations to avoid this penalty.

TLB Structure

The TLB is a small, highly-associative cache that maps virtual page numbers to physical frame numbers. Modern CPUs provide separate TLBs for instructions and data, organized into multiple levels:

Typical TLB Configuration (Intel Golden Cove / AMD Zen 4)

TLB LevelTypeEntriesAssociativityPage Sizes
L1 DTLBData64-96Fully assoc.4 KB, 2 MB
L1 ITLBInstruction64-256Fully/8-way4 KB, 2 MB
L2 STLBUnified1536-20488-12 way4 KB, 2 MB

TLB Sizes Across Real Processors

MicroarchitectureL1 DTLB (4 KB pages)L1 DTLB (2 MB pages)L2 STLB (unified)
Intel Skylake (2015)64 entries, fully assoc.32 entries, fully assoc.1,536 entries, 12-way [Source: Intel, Optimization Reference Manual, 2023]
Intel Golden Cove (2021)96 entries, fully assoc.32 entries, fully assoc.2,048 entries, 16-way [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)Not publicly documentedNot publicly documentedNot publicly documented
AMD Zen 3 (2020)64 entries, fully assoc.64 entries, fully assoc.2,048 entries, 8-way [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)72 entries, fully assoc.72 entries, fully assoc.3,072 entries, 12-way [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]
ARM Cortex-X4 (2023)48 entries, fully assoc.Not publicly documented2,048 entries, 8-way [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)~192 entries (estimated)Not publicly documentedNot publicly documented [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

AMD Zen 4 significantly increased the L2 STLB to 3,072 entries, providing 12 MB of TLB coverage for 4 KB pages — double that of Zen 3 [Source: Chips and Cheese, “Zen 4 Microarchitecture”, 2022]. Intel Golden Cove also expanded L1 DTLB from 64 to 96 entries and L2 STLB from 1,536 to 2,048 entries [Source: WikiChip, “Golden Cove”, 2021]. With 2 MB huge pages, Golden Cove’s STLB covers 4 GB of address space, virtually eliminating TLB misses for most workloads.

L1 TLB lookups happen in parallel with the cache access (on virtually-indexed, physically-tagged caches — see below). An L1 DTLB hit adds zero extra cycles. An L1 DTLB miss that hits in the L2 STLB costs roughly 7-8 cycles. A full TLB miss triggers a hardware page walk.

The Page Walk

When both TLB levels miss, the CPU’s page walk unit (also called the table walker) traverses the page table hierarchy in hardware. On x86-64 with 4-level paging:

Virtual Address: [PML4 index | PDPT index | PD index | PT index | Offset]
                  9 bits       9 bits       9 bits     9 bits     12 bits

Each level requires reading an 8-byte entry from memory. The walk is:

  1. Read PML4 entry using CR3 base + PML4 index.
  2. Read PDPT entry using PML4 entry’s physical address + PDPT index.
  3. Read PD entry using PDPT entry’s physical address + PD index.
  4. Read PT entry using PD entry’s physical address + PT index.
  5. Combine the physical frame from the PT entry with the page offset.

Each step depends on the result of the previous, creating a serial chain. If the intermediate page table entries are cached in L1d or L2, the walk completes in 20-30 cycles. If they miss to DRAM, the walk can take 200+ cycles.

Modern CPUs have multiple page walk units (typically 2 per core) so that several TLB misses can be serviced in parallel. This helps when the code touches many new pages in a short window (e.g., traversing a large hash table or graph).

VIPT: Virtually Indexed, Physically Tagged

Most L1d caches use a virtually indexed, physically tagged (VIPT) design. The index bits come from the virtual address (which is available immediately), while the tag comparison uses the physical address (from the TLB). This works without aliasing issues as long as the number of index + offset bits does not exceed the page offset size.

For a 32 KB, 8-way L1d with 64-byte lines:

  • Sets = 32768 / (8 x 64) = 64 sets
  • Index bits = log2(64) = 6 bits
  • Offset bits = log2(64) = 6 bits
  • Total index + offset = 12 bits = page offset for 4 KB pages

This is not a coincidence. L1d size and associativity are constrained so that the index bits fall within the page offset, avoiding the synonym problem entirely. If you see a CPU with a 48 KB L1d (like Apple M1), the associativity is typically increased to 6-way or 12-way to keep index + offset within 12 bits (for 4 KB pages), or the design uses other techniques like way prediction.

The Impact of Huge Pages

Standard 4 KB pages mean a 1 GB working set requires 262,144 TLB entries — far more than any TLB can hold. TLB misses become a significant performance bottleneck for applications with large memory footprints (databases, scientific computing, ML inference).

Huge pages (2 MB on x86, 1 GB for the largest) reduce TLB pressure dramatically:

Page SizePages for 1 GBTLB Coverage (1536 entries)
4 KB262,1446 MB
2 MB5123 GB
1 GB11536 GB

With 2 MB pages, the L2 STLB alone covers 3 GB, effectively eliminating TLB misses for many workloads.

Huge pages also shorten the page walk by one level (the final PT level is skipped for 2 MB pages), reducing miss latency. On Linux, huge pages can be allocated explicitly (mmap with MAP_HUGETLB) or transparently via THP (Transparent Huge Pages), though THP can introduce latency spikes from compaction and should be evaluated per-workload.

# Check THP status on Linux
cat /sys/kernel/mm/transparent_hugepage/enabled

# Allocate explicit huge pages
echo 512 > /proc/sys/vm/nr_hugepages

TLB Shootdowns

In multiprocessor systems, when the OS modifies page table entries (e.g., during munmap or page migration), it must invalidate stale TLB entries on all cores that may have cached them. This is done via TLB shootdown — an inter-processor interrupt (IPI) that forces remote cores to flush their TLBs.

TLB shootdowns are expensive: the interrupting core stalls waiting for acknowledgment, and the interrupted cores must pause execution, flush TLB entries, and resume. Frequent shootdowns (from aggressive mmap/munmap patterns or memory-mapped I/O) can measurably degrade performance. Newer architectures support PCID (Process Context Identifiers) to tag TLB entries per-process, avoiding full flushes on context switches.

TLB & Page Walk

Visualize TLB hits (1-cycle translation) versus multi-level page table walks on misses. Toggle huge pages to see how 2 MB pages reduce walk depth and increase TLB coverage.

TLB0 / 64 entries--Page Walk (on miss)PML4PDPTPDPTTLB Coverage: 0 KB (0 x 4 KB pages)0%Max coverage: 0.25 MB (64 x 4 KB) — full 4-level walk
TLB ready. Step through to generate address translations.
Hit rate: --Hits: 0Misses: 0Page size: 4 KB

Key Takeaways

  • Every memory access requires virtual-to-physical translation; the TLB caches these translations.
  • A TLB miss triggers a hardware page walk costing 20-200+ cycles depending on page table cacheability.
  • Use 2 MB huge pages for workloads with large footprints to reduce TLB miss rates by orders of magnitude.
  • TLB shootdowns on multicore systems are an often-overlooked source of latency; minimizing page table mutations helps.

The Latency Problem, Again

Even with a well-tuned cache hierarchy (see Section 15: Cache Hierarchy), programs inevitably miss in cache. An L2 miss that must fetch from L3 costs 30-50 cycles; an LLC miss to DRAM costs 200+. If the CPU only begins fetching data when the load instruction executes, those cycles are wasted. Hardware prefetching attempts to predict future accesses and initiate fetches before the data is needed, overlapping memory latency with useful computation.

Types of Hardware Prefetchers

Modern CPUs employ multiple prefetching engines simultaneously, each targeting different access patterns. The exact set varies by microarchitecture, but the common categories are:

Prefetcher Availability by Vendor

Prefetcher TypeIntel (Skylake/Golden Cove)AMD (Zen 3/Zen 4)ARM (Cortex-X4)
L1d Next-Line (DCU)Yes (togglable via MSR)YesYes [Source: ARM, Cortex-X4 TRM, 2023]
L1d Stride (IP-based)Yes (togglable via MSR)YesYes
L2 StreamerYes (up to 20 lines ahead, togglable)Yes (stream-based)Yes
L2 Adjacent LineYes (togglable via MSR)NoNot publicly documented
Region-based (spatial)No (in Skylake); introduced in later designsYes (Zen 3+: region-based L2 prefetcher)Yes (Cortex-X4: SMS/DCPF)

[Source: Intel, Optimization Reference Manual, 2023; AMD, Software Optimization Guide for AMD Family 19h, 2022]

Intel has four configurable prefetchers (L1d DCU, L1d IP-based stride, L2 streamer, L2 adjacent line) that can each be independently toggled via MSR 0x1A4 [Source: Intel, Optimization Reference Manual, 2023]. Golden Cove and later may add additional spatial/region prefetching.

AMD Zen 3/4 uses a different mix: L1d next-line and stride prefetchers similar to Intel, but replaces the L2 adjacent line prefetcher with a more sophisticated region-based L2 prefetcher that tracks spatial access patterns within larger memory regions [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]. AMD’s prefetchers are generally less configurable than Intel’s (fewer MSR-level toggles).

ARM Cortex-X4 and Apple M-series cores implement prefetchers tailored for mobile and SoC workloads. ARM cores typically include stride, stream, and spatial (SMS-like) prefetchers. Apple’s cores have been observed to have highly aggressive prefetching that handles both stride and irregular patterns better than typical x86 implementations [Source: ARM, Cortex-X4 TRM, 2023].

L1d Next-Line (Adjacent Line) Prefetcher

The simplest prefetcher. On an L1d cache miss at line N, the hardware also fetches line N+1 (and sometimes N-1). This exploits spatial locality and is effective for sequential access patterns.

On Intel cores, this is called the DCU prefetcher and can be toggled via MSR. It works well for streaming accesses but wastes bandwidth when access patterns are sparse (e.g., accessing every 8th cache line in a large array).

L1d Stride Prefetcher (IP-based)

Tracks the memory access history per load instruction (identified by instruction pointer / IP). If a load at IP 0x4010a0 accesses addresses A, A+128, A+256, the prefetcher detects a stride of 128 bytes and begins prefetching A+384, A+512, and so on.

The stride prefetcher is critical for:

  • Traversing arrays of structs with non-unit stride
  • Matrix column access (stride = row size in bytes)
  • Linked data structures with predictable allocation patterns

It typically tracks 32-64 streams simultaneously. If too many different strides compete, older entries are evicted and those streams lose prefetch coverage.

L2 Streamer Prefetcher

The L2 streamer detects sequential access patterns at the L2 level and prefetches multiple lines ahead into L2 (and sometimes into L1d). It can detect both forward and backward streams and maintain a larger prefetch distance (how far ahead to fetch) than the L1 prefetchers.

Intel’s L2 streamer can track up to 32 streams and prefetch up to 20 lines ahead of the current demand access. When it detects a stable stream, it progressively increases the prefetch distance — a technique called adaptive prefetch distance.

L2 Adjacent Line Prefetcher

Similar to the L1 next-line prefetcher but operates at L2 granularity. On an L2 miss for line N, it fetches line N+1 into L2. This doubles the effective L2 bandwidth utilization for sequential patterns but consumes L2 capacity more aggressively.

When Prefetching Helps

Prefetching delivers the most benefit when:

  1. Access patterns are regular: Sequential scans, constant-stride iterations, and streaming workloads see major speedups. A well-prefetched stream can hide nearly all DRAM latency.

  2. Compute-to-memory ratio is low: If each cache line’s data requires substantial computation, the CPU naturally overlaps computation with prefetching. But for “thin” loops (little work per element), prefetching keeps the pipeline fed.

  3. Working set exceeds cache but fits in DRAM: Prefetching helps bridge the L3-to-DRAM gap. If data fits entirely in cache, prefetching has no role.

// Sequential scan: hardware prefetchers handle this perfectly
float sum = 0.0f;
for (int i = 0; i < N; i++) {
    sum += array[i];
}

When Prefetching Hurts

Hardware prefetching is not free. It consumes memory bandwidth, cache capacity, and power. It can actively degrade performance in these scenarios:

Cache Pollution

Prefetched lines evict useful data from the cache. If the prefetched data is never used (because the access pattern is unpredictable), the net effect is negative. This is common with pointer-chasing workloads (linked lists, trees, hash tables with chaining) where the next address depends on the loaded data.

Bandwidth Waste

On bandwidth-constrained platforms (e.g., multi-threaded server workloads sharing memory controllers), unnecessary prefetches steal bandwidth from demand requests. This can increase tail latency for other threads.

Irregular Access Patterns

Patterns like indirect array access (A[B[i]]) defeat stride detection. The hardware may detect a false stride and prefetch entirely wrong addresses:

// Indirect access: prefetchers cannot predict B[i]
for (int i = 0; i < N; i++) {
    sum += A[B[i]];  // random-looking access to A
}

For such patterns, software prefetching may help if the programmer can compute future addresses:

for (int i = 0; i < N; i++) {
    __builtin_prefetch(&A[B[i + 8]], 0, 1);  // prefetch 8 iterations ahead
    sum += A[B[i]];
}

Software prefetches (prefetchnta, prefetcht0 on x86; prfm on ARM) inject explicit prefetch requests. They require manual tuning of the prefetch distance and can make code less portable, but they are sometimes the only option for irregular patterns.

Controlling Hardware Prefetchers

On x86, hardware prefetchers can be individually enabled or disabled via MSR 0x1A4 (model-specific; varies by generation). This is sometimes useful for:

  • Benchmarking: Isolating the contribution of each prefetcher.
  • Latency-sensitive workloads: Disabling L2 adjacent-line prefetcher to reduce cache pollution.
  • NUMA-aware tuning: Reducing cross-socket prefetch traffic.
# Example: disable L2 adjacent line prefetcher on Intel (bit 1 of MSR 0x1A4)
wrmsr -p 0 0x1A4 0x2

On server platforms, BIOS settings often expose prefetcher controls. Cloud environments typically do not allow MSR modifications, so software prefetch instructions and data layout changes are the primary tuning knobs.

Interaction with Out-of-Order Execution

The out-of-order engine (described in Reorder Buffer) provides a form of implicit prefetching. By executing loads well ahead of when their results are consumed, the OoO window naturally overlaps memory latency with computation. A core with a 256-entry reorder buffer can “see” roughly 100+ instructions ahead, covering 20-50 cycles of latency naturally.

Hardware prefetchers extend this window further. The L2 streamer may fetch 20 lines ahead — at 64 bytes per line, that is 1280 bytes of lookahead. Combined with the OoO window, the effective lookahead can cover most of the DRAM latency for regular patterns.

Key Takeaways

  • Modern CPUs run 4-6 prefetchers simultaneously, each targeting different patterns.
  • Sequential and strided accesses are well-handled by hardware; irregular patterns are not.
  • Prefetching trades bandwidth and cache capacity for latency hiding — measure to confirm it helps.
  • Software prefetch intrinsics (__builtin_prefetch, _mm_prefetch) can help with indirect or irregular access patterns but require careful tuning of prefetch distance.
  • Consider the interaction between OoO execution lookahead and hardware prefetch; together they can hide substantial memory latency.

The Ordering Illusion

A modern out-of-order CPU executes instructions in whatever order maximizes throughput, not in the order the programmer wrote them. This includes memory operations — loads and stores may execute out of their original program order. The memory model defines what reorderings the hardware is allowed to perform and what guarantees it provides to software.

Understanding memory ordering is essential for writing correct lock-free data structures, reasoning about multi-threaded programs, and understanding the performance implications of memory fences.

Single-Core Ordering

For a single thread, the CPU maintains the illusion of sequential execution. Even though loads and stores may execute out of order internally, the results visible to that thread are indistinguishable from in-order execution. This guarantee is achieved through mechanisms like the store buffer, load buffer, and memory disambiguation (see Section 16: Store Buffer).

Key single-thread guarantees:

  • Loads are never reordered past loads to the same address.
  • Store-to-load forwarding ensures a load sees the most recent store to the same address from the same thread.
  • Stores retire in order from the store buffer to the cache.

The CPU may still reorder loads with respect to other loads (to different addresses) and loads with respect to stores, as long as single-thread semantics are preserved.

Multi-Core: Where Ordering Matters

On a single core, reordering is invisible to the program. On multiple cores, it becomes visible and can cause bugs. Consider two threads:

Thread 1:                    Thread 2:
store X = 1                  store Y = 1
load r1 = Y                  load r2 = X

Under sequential consistency (the strongest model), at least one thread must see the other’s store, so r1 = 0 AND r2 = 0 is impossible. But on real hardware (x86 included), this outcome can occur because each core’s store sits in its local store buffer and is not yet visible to the other core when the load executes.

x86-TSO: Total Store Order

x86 implements a relatively strong memory model called TSO (Total Store Order). The rules are:

  1. Loads are not reordered with other loads. If load A is before load B in program order, A executes first (or at least appears to).
  2. Stores are not reordered with other stores. Stores become visible to other cores in program order.
  3. Stores may be reordered after later loads. This is the one relaxation: a load can execute before an older store (to a different address) becomes visible. This is the store buffer effect.
  4. Locked instructions have total order. LOCK-prefixed instructions (and XCHG with memory) act as full barriers.

TSO means x86 programmers rarely need explicit fences for acquire/release semantics — loads naturally have acquire semantics and stores naturally have release semantics. The only pattern that requires a fence is the store-load pattern (shown above), which needs an MFENCE or a LOCKed operation.

ARM and RISC-V: Weaker Models

ARM (ARMv8-A) and RISC-V implement weakly ordered memory models where all four reorderings are permitted:

  • Load-Load reordering
  • Load-Store reordering
  • Store-Load reordering
  • Store-Store reordering

This gives the hardware maximum freedom to optimize, but requires the programmer (or compiler) to insert explicit barriers for correctness.

Memory Model Comparison by ISA

Propertyx86/x86-64 (TSO)ARM (ARMv8-A)RISC-V (RVWMO)
Load-Load reorderingNot allowedAllowedAllowed
Load-Store reorderingNot allowedAllowedAllowed
Store-Store reorderingNot allowedAllowedAllowed
Store-Load reorderingAllowedAllowedAllowed
Acquire/Release built-inLoads = acquire, Stores = releaseMust use LDAR/STLRMust use fence or .aq/.rl
Full fence instructionMFENCE / LOCK prefixDMB SYfence

[Source: Intel, Intel 64 and IA-32 Architectures SDM Vol 3, 2023; ARM, ARM Architecture Reference Manual ARMv8-A, 2023]

The practical consequence: x86 code rarely needs explicit fences for acquire/release patterns because TSO provides them automatically. ARM code must use LDAR/STLR (load-acquire/store-release) or DMB barriers for correct lock-free programming. This is a major source of porting bugs when moving concurrent code from x86 to ARM (e.g., when migrating to AWS Graviton or Apple Silicon) [Source: ARM, ARM Architecture Reference Manual ARMv8-A, 2023].

// ARM acquire-release pattern
STLR X0, [X1]    // store-release: all prior accesses complete before this store
LDAR X2, [X3]    // load-acquire: this load completes before any subsequent access

ARM provides DMB (Data Memory Barrier), DSB (Data Synchronization Barrier), and ISB (Instruction Synchronization Barrier) with various scope options (inner/outer shareable, load/store/full).

In practice, programmers should use language-level atomics (std::atomic in C++, Atomic* in Java/Rust) rather than raw fence instructions. The compiler maps these to the correct architecture-specific barriers.

The Store Buffer’s Role in Ordering

The store buffer (covered in Section 16) is the primary microarchitectural mechanism responsible for store-load reordering. When a core executes a store, the data enters the store buffer but is not yet visible to other cores. A subsequent load from a different address can execute immediately (reading from cache or memory), seeing a state that does not yet include the local store.

This is not a bug — it is a deliberate design choice. Making stores visible instantly would require draining the store buffer on every store, destroying performance. The store buffer allows stores to be absorbed asynchronously while the core continues executing.

An MFENCE instruction on x86 forces the store buffer to drain before any subsequent load can execute:

mov [X], 1        ; store to X
mfence             ; drain store buffer -- all prior stores visible to all cores
mov rax, [Y]      ; load Y -- guaranteed to see any store to Y that happened before X was visible

MFENCE is expensive: it stalls the pipeline for 20-40+ cycles on modern x86, depending on store buffer occupancy and memory subsystem state.

Compiler Reordering vs. Hardware Reordering

Memory ordering has two layers:

  1. Compiler reordering: The compiler may reorder loads and stores during optimization, as long as single-thread semantics are preserved. A volatile qualifier (in C/C++) prevents the compiler from reordering accesses to that variable but does not prevent hardware reordering.

  2. Hardware reordering: The CPU reorders at runtime. Preventing this requires fence instructions.

Both must be addressed for correctness. C++11 atomics handle both: std::atomic with appropriate memory ordering (memory_order_acquire, memory_order_release, memory_order_seq_cst) inserts both compiler barriers and hardware fences as needed.

std::atomic<int> flag{0};
int data = 0;

// Producer
data = 42;
flag.store(1, std::memory_order_release);  // compiler + hardware barrier

// Consumer
while (flag.load(std::memory_order_acquire) == 0) {}  // spin
assert(data == 42);  // guaranteed to see 42

Performance Implications

Memory fences and atomic operations have real performance costs:

OperationApproximate Cost (cycles)
Regular load (L1 hit)4-5
Regular store4-5
MFENCE20-40
LOCK ADD (atomic RMW, L1 hit)15-25
LOCK CMPXCHG (CAS, L1 hit)15-25
LOCK CMPXCHG (CAS, contended)50-200+

In the uncontended case, atomic operations add 3-5x overhead over regular accesses. Under contention (multiple cores hammering the same cache line), costs explode due to cache coherence traffic. The cache line bounces between cores in Modified state, and each access incurs the latency of the coherence protocol.

This is why high-performance concurrent code uses techniques like per-thread counters (sharded atomics), read-copy-update (RCU), and careful data structure padding to avoid false sharing (see Section 25: Case Studies).

Key Takeaways

  • x86 has a strong (TSO) model where only store-load reordering is allowed; ARM and RISC-V allow all four types.
  • Use language-level atomics, not raw fences or volatile, for portable correctness.
  • Memory fences and atomic operations are 3-50x more expensive than regular memory accesses; minimize their use in hot paths.
  • The store buffer is the microarchitectural root cause of store-load reordering.
  • False sharing (two threads writing to different variables on the same cache line) causes coherence traffic equivalent to true sharing; pad shared structures to cache line boundaries.

Branches Are Expensive When Mispredicted

As discussed in Branch Prediction, a mispredicted branch costs 15-20+ cycles of pipeline flush. For branches with a near-50/50 taken/not-taken ratio, even the best predictors struggle, resulting in frequent mispredictions and significant throughput loss.

Predication offers an alternative: instead of branching, compute both paths and select the correct result with a conditional move or select instruction. This eliminates the branch entirely, trading potential misprediction penalty for a fixed, predictable cost.

CMOVcc on x86

The x86 ISA provides conditional move instructions (CMOVcc) that move data based on condition flags without branching:

; Branching version
cmp  eax, ebx
jge  .use_eax
mov  eax, ebx       ; eax = max(eax, ebx)
.use_eax:

; Predicated version
cmp  eax, ebx
cmovl eax, ebx      ; eax = (eax < ebx) ? ebx : eax -- no branch

The CMOV version always executes both the comparison and the move. There is no branch, so there is no possibility of misprediction. The cost is fixed: the CMOV instruction itself has a latency of 1-2 cycles on most microarchitectures.

CMOVcc Latency Across Microarchitectures

MicroarchitectureCMOVcc Latency (cycles)CMOVcc Throughput (per cycle)
Intel Skylake (2015)1 cyc2 per cyc (ports 0, 6) [Source: Agner Fog, Instruction Tables, 2023]
Intel Golden Cove (2021)1 cyc3 per cyc [Source: Agner Fog, Instruction Tables, 2023]
AMD Zen 3 (2020)1 cyc4 per cyc (any ALU pipe) [Source: Agner Fog, Instruction Tables, 2023]
AMD Zen 4 (2022)1 cyc4 per cyc [Source: Agner Fog, Instruction Tables, 2023]
Apple M1 Firestorm (2020)1 cycNot publicly documented

On both Intel and AMD, CMOVcc has 1-cycle latency, making it competitive with a well-predicted branch (which has 0 effective overhead) only when the branch is poorly predicted. The key difference is throughput: AMD Zen cores can execute CMOVcc on any of the 4 ALU pipes, while Skylake limits it to 2 ports [Source: Agner Fog, Instruction Tables, 2023]. Golden Cove improved CMOVcc throughput to 3 per cycle.

Note that CMOVcc with a memory source operand (e.g., cmovl eax, [rdi]) always performs the load regardless of the condition, adding load latency (4-5 cycles for an L1d hit). This differs from a branching version where the load is skipped if the branch is not taken.

In C/C++, the compiler can emit CMOV for simple ternary expressions:

int max = (a > b) ? a : b;    // likely compiled to CMP + CMOV

When Predication Wins

Predication is superior to branching when:

1. The Branch Is Hard to Predict

Data-dependent branches with near-random outcomes (e.g., filtering based on data values, binary search comparisons, sorting network comparisons) achieve close to 50% misprediction rate. A 15-cycle misprediction penalty every other iteration is devastating.

// Hard-to-predict branch: data-dependent with ~50% taken rate
for (int i = 0; i < N; i++) {
    if (data[i] > threshold) {
        count++;
    }
}

Converting to a branchless form:

for (int i = 0; i < N; i++) {
    count += (data[i] > threshold);  // comparison produces 0 or 1, no branch
}

This version compiles to arithmetic instructions without any branch, completely eliminating misprediction penalties. For uniformly distributed random data, the branchless version can be 2-5x faster.

2. The Conditional Operation Is Cheap

If the work inside the branch is a single register move, add, or similar low-latency operation, predication is efficient. The wasted computation of the “wrong” path is negligible.

3. The Loop Body Is Short

In tight loops, even a single misprediction per iteration can dominate execution time. Predication keeps the pipeline fully utilized.

When Predication Loses

Predication is not always the right choice. It is worse when:

1. The Branch Is Highly Predictable

If the branch is taken 99% of the time, the predictor handles it almost perfectly. Predication wastes work on the not-taken path every iteration, while branching pays the misprediction penalty only 1% of the time:

// Almost always true -- predictor handles this well
if (ptr != nullptr) {
    result = ptr->value;  // a branch is better here
}

2. The Conditional Operation Is Expensive

If the “other” path involves a memory access, a division, or other high-latency operation, predication forces the CPU to execute it unconditionally:

// Bad candidate for predication: expensive false path
int result = condition ? cheap_value : expensive_function();

With a branch, the expensive path is only executed when needed. With predication, it always executes.

3. CMOV Creates a Long Dependency Chain

CMOV consumes both source values as inputs, creating a data dependency on both paths. In a loop, this can serialize iterations:

; Loop-carried dependency through CMOV
loop:
    ; ...compute new_val...
    cmp  new_val, current_min
    cmovl current_min, new_val   ; current_min depends on BOTH new_val and old current_min
    ; next iteration depends on current_min

The branch version allows the predictor to speculate that current_min does not change (which is true most of the time for a min-scan), breaking the dependency chain. The CMOV version must wait for current_min to be resolved every iteration.

This is a subtle but important tradeoff: branches allow speculative execution down the predicted path; predication does not.

Compiler Heuristics

Modern compilers (GCC, Clang, MSVC) use heuristics to decide between branches and CMOV:

  • Profile-guided optimization (PGO): With real branch statistics, the compiler can choose optimally.
  • Branch probability annotations: __builtin_expect or [[likely]]/[[unlikely]] hints guide the compiler.
  • Cost model: The compiler estimates the cost of both paths. Short, cheap operations favor CMOV; long, expensive operations favor branches.

GCC’s -fif-conversion controls whether the compiler converts branches to conditional moves. Clang generally applies CMOV conversion more aggressively than GCC.

You can force branchless code via arithmetic tricks:

// Force branchless min
int min = b + ((a - b) & ((a - b) >> 31));  // works for 32-bit signed ints

However, such manual tricks reduce readability and may not outperform a well-chosen CMOV. Prefer letting the compiler decide, and use PGO to give it accurate information.

Predication on ARM

ARM has a richer predication model. ARMv7 supports fully predicated instructions — almost any instruction can be conditionally executed:

CMP   R0, R1
ADDGE R2, R2, #1   ; add only if R0 >= R1

ARMv8 (AArch64) removed universal predication in favor of conditional select (CSEL, CSINC, CSNEG) and conditional compare (CCMP) instructions, which are more efficient for the out-of-order pipeline:

CMP   W0, W1
CSEL  W2, W0, W1, GT   ; W2 = (W0 > W1) ? W0 : W1

ARM’s SVE and SVE2 vector extensions use predicate registers for masked vector operations, enabling per-lane predication in SIMD (see Section 21: SIMD).

Predication vs Branching

Compare throughput of branch instructions vs predicated (CMOVcc) execution. Drag the slider to change branch predictability.

Branch (conditional jump)FASTER3.88 IPC
3.88
CMOVcc (predicated)3.40 IPC
3.40

Crossover point: ~0% predictability. Below this, CMOVcc wins because it avoids the 15-cycle misprediction penalty.

CMOVcc always executes both paths but has constant throughput (3.40 IPC). Branches are faster when highly predictable but degrade with mispredictions.

Misprediction rate: 20%Penalty per mispredict: 15 cyclesCMOVcc overhead: 15%

Key Takeaways

  • Predication replaces branches with conditional data movement, eliminating misprediction penalties.
  • Use predication when the branch is hard to predict and the conditional operation is cheap.
  • Avoid predication when the branch is predictable (>90% one way) or when it creates long dependency chains.
  • Compilers make reasonable decisions but benefit greatly from PGO; manual branchless tricks should be a last resort.
  • Profile before deciding: the difference between branching and predication depends on the actual misprediction rate, which varies with runtime data.

Data-Level Parallelism

Instruction-level parallelism (ILP) allows multiple independent scalar instructions to execute in the same cycle. Data-level parallelism (DLP) takes a different approach: a single instruction operates on multiple data elements simultaneously. This is the core idea behind SIMD — Single Instruction, Multiple Data.

Instead of adding two integers with one ADD instruction, a SIMD ADD processes 4, 8, 16, or more integers in a single operation. The throughput gain is proportional to the vector width — the number of elements processed per instruction.

The SIMD Landscape

x86 SIMD Evolution

ExtensionRegister WidthYearKey Features
SSE128-bit (XMM)19994x float
SSE2128-bit20012x double, integer ops
AVX256-bit (YMM)20118x float, 4x double
AVX2256-bit2013256-bit integer ops, gather, FMA
AVX-512512-bit (ZMM)201716x float, masking, scatter, conflict detection

Each generation doubles the vector width and adds new capabilities. AVX2 is the de facto baseline for modern x86 performance work; AVX-512 offers further gains but with caveats (discussed below).

SIMD ISA Support by Microarchitecture

MicroarchitectureSSE4.2AVX2AVX-512Notes
Intel Skylake (2015, client)YesYesNo (Skylake-X server only)AVX-512 restricted to HEDT/server SKUs [Source: Intel, Optimization Reference Manual, 2023]
Intel Golden Cove (2021)YesYesYes (P-cores only)E-cores (Gracemont) lack AVX-512; OS may disable it for uniformity [Source: Intel, Optimization Reference Manual, 2023]
Intel Lion Cove (2024)YesYesYes (AVX10/256 baseline)Moving toward AVX10 convergence [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)YesYesNoNo AVX-512 support [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)YesYesYes (double-pumped 256-bit)AVX-512 via 2x 256-bit passes; no frequency penalty [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 5 (2024)YesYesYes (native 512-bit on some units)Full-width 512-bit for some operations [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)N/AN/AN/ANEON (128-bit mandatory), SVE2 (128-256 bit, implementation-defined) [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)N/AN/AN/ANEON (128-bit), no SVE [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

Intel’s AVX-512 story is particularly complex. AVX-512 debuted on Skylake-X (server/HEDT) in 2017. It was present on Golden Cove P-cores in Alder Lake (2021), but the heterogeneous E-cores lacked it, leading some OS configurations to disable it entirely. Intel is converging on AVX10 as the future ISA, which standardizes 256-bit as the baseline and makes 512-bit optional [Source: Intel, “AVX10 ISA Specification”, 2023]. AMD entered AVX-512 with Zen 4 using a pragmatic double-pumped approach that avoids the clock throttling issues that plagued Intel’s implementations.

ARM SIMD

ExtensionRegister WidthNotes
NEON128-bit32 registers, ubiquitous on ARMv8
SVE128-2048 bitScalable; vector length set by hardware
SVE2128-2048 bitAdds fixed-point, crypto, gather/scatter

ARM’s SVE (Scalable Vector Extension) uses a vector-length-agnostic programming model: code compiles once and runs on hardware with any vector width from 128 to 2048 bits. This avoids the recompilation and dispatch problem that plagues x86’s fixed-width ISA extensions.

How SIMD Instructions Work

A SIMD add of 8 floats (AVX):

vmovaps ymm0, [rsi]        ; load 8 floats from memory into ymm0
vmovaps ymm1, [rdi]        ; load 8 floats from memory into ymm1
vaddps  ymm2, ymm0, ymm1   ; add 8 pairs of floats in parallel
vmovaps [rdx], ymm2         ; store 8 result floats

This processes 8 additions in a single instruction, versus 8 scalar addss instructions. On a core with 2 SIMD add units, the throughput is 16 float additions per cycle.

In C/C++, SIMD can be accessed via:

  1. Intrinsics — thin wrappers around SIMD instructions:

    #include <immintrin.h>
    __m256 a = _mm256_load_ps(src_a);
    __m256 b = _mm256_load_ps(src_b);
    __m256 c = _mm256_add_ps(a, b);
    _mm256_store_ps(dst, c);
  2. Auto-vectorization — the compiler transforms scalar loops into SIMD code automatically.

  3. Explicit vector types (GCC/Clang __attribute__((vector_size(32))))

Auto-Vectorization

Modern compilers (GCC, Clang, MSVC, ICC) can automatically convert scalar loops into SIMD operations. This works best when:

  • The loop has no loop-carried dependencies (each iteration is independent).
  • The trip count is known or can be bounded.
  • Memory accesses are contiguous (unit stride).
  • There are no function calls in the loop body (unless the function is inline and side-effect free).
  • There is no aliasing between input and output arrays.
// Good auto-vectorization candidate
void add_arrays(float *dst, const float *a, const float *b, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = a[i] + b[i];
    }
}

Compiling with gcc -O2 -march=native or clang -O2 -march=native will auto-vectorize this to AVX/AVX2 on capable hardware.

Helping the Auto-Vectorizer

When auto-vectorization fails, you can often fix it without resorting to intrinsics:

  • restrict pointers: Tell the compiler that arrays do not alias.
    void add_arrays(float * restrict dst, const float * restrict a, const float * restrict b, int n)
  • #pragma omp simd: Explicitly request vectorization (OpenMP 4.0+).
  • Avoid data-dependent control flow: Branches inside loops inhibit vectorization. Use predication or masking instead (see Section 20: Predication).
  • Align data: Allocate arrays with aligned_alloc(32, size) for AVX or aligned_alloc(64, size) for AVX-512.

Use compiler reports to diagnose vectorization failures:

gcc -O2 -march=native -fopt-info-vec-missed  # GCC
clang -O2 -march=native -Rpass-missed=loop-vectorize  # Clang

SIMD Performance Considerations

Throughput vs. Latency

SIMD instructions have the same or similar latency as scalar equivalents (3-5 cycles for FP add, 4-5 for FP multiply) but process N elements per instruction. The throughput gain is roughly Nx, minus overhead from shuffle/blend operations, alignment handling, and loop preamble/epilogue.

Memory Bandwidth

SIMD increases compute throughput but also increases memory demand proportionally. A loop that processes 8 floats per iteration needs 8x the memory bandwidth. If the workload is already memory-bound (see Section 26: Measurement), wider SIMD provides no benefit — it just waits on memory faster.

AVX-512 Power and Frequency Concerns

On Intel CPUs, AVX-512 instructions can cause the core to reduce its frequency (the “AVX offset”). This happens because the wider 512-bit execution units consume more power and generate more heat. The frequency reduction can be 100-300 MHz:

  • AVX2 (256-bit): Small or no frequency reduction.
  • AVX-512 (light): ~100 MHz reduction.
  • AVX-512 (heavy FP): ~200-300 MHz reduction.

If only a small fraction of the code uses AVX-512, the frequency reduction may penalize the surrounding scalar code. The net effect can be negative. Measure the full workload, not just the SIMD kernel.

On AMD Zen 4 and Intel’s recent P-cores, AVX-512 is implemented as 256-bit double-pumped, avoiding the worst frequency penalties while still providing AVX-512 programmability.

Horizontal Operations

SIMD works best for vertical operations (element-wise across vectors). Horizontal operations (reducing a vector to a scalar, like summing all elements) require shuffles and are much less efficient:

// Horizontal sum of a 256-bit vector: requires multiple shuffles
__m256 v = ...;
__m128 hi = _mm256_extractf128_ps(v, 1);
__m128 lo = _mm256_castps256_ps128(v);
__m128 sum = _mm_add_ps(lo, hi);           // 4 floats
sum = _mm_add_ps(sum, _mm_movehl_ps(sum, sum));  // 2 floats
sum = _mm_add_ss(sum, _mm_movehdup_ps(sum));     // 1 float
float result = _mm_cvtss_f32(sum);

Keep horizontal reductions outside of inner loops where possible.

SIMD Vectorization

Compare scalar processing vs SIMD vectorized processing with different register widths.

Lanes: 1Iterations: 16Cycles: 64 (scalar: 64)Speedup: 1.0x
Processing timeline
iter 0
[0]
cycle 03
iter 1
[1]
cycle 47
iter 2
[2]
cycle 811
iter 3
[3]
cycle 1215
iter 4
[4]
cycle 1619
iter 5
[5]
cycle 2023
iter 6
[6]
cycle 2427
iter 7
[7]
cycle 2831
iter 8
[8]
cycle 3235
iter 9
[9]
cycle 3639
iter 10
[10]
cycle 4043
iter 11
[11]
cycle 4447
iter 12
[12]
cycle 4851
iter 13
[13]
cycle 5255
iter 14
[14]
cycle 5659
iter 15
[15]
cycle 6063
Scalar loop
for (i = 0; i < 16; i++)
  a[i] = b[i] + c[i];
// 16 iterations
// 64 cycles
Vectorized loop (Scalar (1 lane))
for (i = 0; i < 16; i += 1)
  v_add(a+i, b+i, c+i);
// 16 iterations
// 64 cycles
Each colored box is one array element processed in a single cycle window. Wider SIMD registers process more elements per iteration.

Key Takeaways

  • SIMD provides near-linear throughput scaling with vector width for compute-bound, data-parallel workloads.
  • Auto-vectorization is the preferred approach; intrinsics are needed for complex patterns.
  • Memory bandwidth limits SIMD benefit for memory-bound code — wider vectors just wait faster.
  • AVX-512 has frequency reduction penalties on some Intel CPUs; measure full-application impact.
  • Align data to vector width boundaries and use restrict pointers to maximize auto-vectorization opportunities.

One Core, Multiple Threads

A single CPU core has many execution resources: ALUs, load/store units, FP/SIMD units, reorder buffer entries, reservation station slots. Most workloads do not saturate all of these resources simultaneously — a memory-bound workload leaves ALUs idle, a compute-bound workload leaves load units idle, and branch mispredictions create pipeline bubbles that waste everything.

Simultaneous Multithreading (SMT) addresses this by allowing a single physical core to execute instructions from two or more hardware threads concurrently. Each thread has its own architectural state (registers, program counter, stack pointer) but shares the core’s execution resources. The OS sees each hardware thread as a separate logical processor.

Intel markets their SMT implementation as Hyper-Threading Technology (HTT). AMD Zen cores implement 2-way SMT. IBM POWER9/10 supports up to 8-way SMT.

SMT Implementations Across Vendors

Vendor / MicroarchitectureSMT Threads per CoreNotes
Intel Skylake through Lion Cove2 (Hyper-Threading)Available on most SKUs; can be disabled in BIOS [Source: Intel, Optimization Reference Manual, 2023]
Intel E-cores (Gracemont, Crestmont)1 (no SMT)Efficiency cores are single-threaded [Source: Intel, Optimization Reference Manual, 2023]
AMD Zen 3 (2020)2 (SMT)Enabled by default; configurable in BIOS [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 4 (2022)2 (SMT)Enabled by default [Source: AMD, Software Optimization Guide for AMD Family 19h, 2022]
AMD Zen 5 (2024)2 (SMT)Enabled by default [Source: Chips and Cheese, “Zen 5 Microarchitecture”, 2024]
ARM Cortex-X4 (2023)1 (no SMT)ARM big cores typically do not implement SMT [Source: ARM, Cortex-X4 TRM, 2023]
ARM Cortex-A720 (2023)1 (no SMT)DynamIQ mid-cores also lack SMT [Source: ARM, Cortex-A720 TRM, 2023]
Apple M1 Firestorm (2020)1 (no SMT)Apple has never shipped SMT on any core [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

ARM and Apple design philosophy differs fundamentally from Intel/AMD on SMT. Rather than sharing a complex core between two threads, ARM and Apple prefer to instantiate more independent cores. Apple’s M1 has 4 Firestorm (performance) + 4 Icestorm (efficiency) cores with no SMT, relying on the large core count and heterogeneous design for throughput. ARM’s DynamIQ big.LITTLE approach similarly uses multiple single-threaded cores at different performance/power points [Source: ARM, DynamIQ Technology, 2023].

What Is Shared vs. Partitioned

Understanding which resources are shared, partitioned, or duplicated is critical for predicting SMT’s performance impact:

Duplicated Per Thread (Private)

  • Architectural registers (general-purpose, SIMD, FP, control)
  • Instruction pointer / program counter
  • Interrupt state

Statically Partitioned

  • Reorder buffer entries (split evenly when both threads are active)
  • Load buffer entries
  • Store buffer entries (on some designs)

Competitively Shared (Dynamic)

  • Execution ports and functional units
  • Reservation station / scheduler entries
  • L1d cache and L1i cache (shared capacity; both threads’ data competes for the same lines)
  • L2 cache
  • Branch predictor tables (shared capacity; may cause aliasing)
  • TLB entries

Fully Shared

  • L3 cache
  • Memory controllers
  • Interconnect bandwidth

When SMT Helps

SMT improves throughput when the two threads have complementary resource usage — when one thread’s idle resources can absorb the other thread’s work.

Memory-Latency-Bound Workloads

If thread A stalls waiting for a cache miss (50-200 cycles for L3/DRAM), thread B can use the execution units during that time. This is the primary value proposition of SMT and the reason it was invented.

Thread A: [execute][stall on cache miss.....................][execute]
Thread B:          [execute][execute][stall on miss........][execute]
Combined: [  A   ][   B   ][  A+B  ][       B           ][  A+B  ]

Server workloads (web serving, databases, key-value stores) with high cache miss rates typically see 15-30% throughput improvement from enabling SMT.

Heterogeneous Instruction Mix

If thread A is compute-heavy (saturating ALUs) and thread B is memory-heavy (saturating load units), they complement each other. The combined throughput exceeds what either could achieve alone, though neither reaches its single-threaded performance.

High Branch Misprediction Rates

Pipeline flushes from mispredictions leave the backend empty. SMT allows the other thread to fill some of those empty slots, partially recovering the lost throughput.

When SMT Hurts

SMT is not free. Sharing resources means each thread gets fewer resources than it would running alone on the core.

Compute-Bound Workloads

If a single thread already saturates the execution units (e.g., dense linear algebra, tight vectorized loops), adding a second thread just creates contention. Both threads run slower, and total throughput may not improve or may even decrease:

Single thread:  100% of execution resources -> 100 ops/cycle
Two SMT threads: 50%/50% split -> 50 + 50 = 100 ops/cycle (no gain)
                 Plus overhead -> maybe 95 ops/cycle (net loss)

LINPACK and other dense compute benchmarks typically show no benefit or slight regression from SMT.

Cache-Sensitive Workloads

Two threads sharing L1d and L2 effectively halve the available cache per thread. If a workload’s performance is highly sensitive to cache capacity (e.g., working set just barely fits in L1d), SMT can push the working set out of cache, causing dramatic slowdowns.

The branch predictor and TLB also suffer from capacity sharing. Two threads with different branch patterns pollute each other’s predictor entries, potentially increasing misprediction rates for both.

Latency-Sensitive Workloads

While SMT improves throughput, it typically degrades single-thread latency. The thread competes for resources it previously had exclusively. For latency-sensitive applications (trading systems, game engines), disabling SMT may improve worst-case latency even at the cost of throughput.

SMT and Security: Side Channels

SMT has significant security implications because threads sharing a core can observe each other’s microarchitectural side effects:

  • Cache timing attacks: Thread A can observe which cache lines thread B evicts.
  • Port contention attacks: Thread A can measure execution unit contention caused by thread B’s instructions, inferring instruction types.
  • TLB and branch predictor probing: Shared predictors leak information about control flow.

These side channels led to attacks like PortSmash and contributed to the practical exploitability of Spectre variants (see Section 23: Speculative Execution). Some security-critical deployments (cloud VMs, cryptographic workloads) disable SMT entirely to eliminate cross-thread leakage.

Measuring SMT Benefit

To determine whether SMT helps your workload:

  1. Run with SMT enabled and disabled (disable via BIOS or echo 0 > /sys/devices/system/cpu/cpuN/online on Linux).
  2. Measure throughput (total work completed per unit time), not single-thread performance.
  3. Monitor resource utilization: Use perf stat or VTune to check execution port utilization, cache miss rates, and backend stalls (see Section 26: Measurement).

A rule of thumb: if single-thread IPC (instructions per cycle) is below 2.0 on a 4-6 wide core, SMT is likely beneficial. If IPC is above 3.0, SMT may provide little gain.

SMT Configuration in Practice

PlatformDefault SMTRecommendation
ServersEnabledKeep enabled for most workloads
DesktopEnabledKeep enabled unless latency-sensitive
HPCVariesDisable for dense compute; enable for sparse
Cloud/VMsVariesSecurity-sensitive tenants may disable
Real-timeDisabledEliminate scheduling jitter

Key Takeaways

  • SMT shares one core’s execution resources across 2+ hardware threads, improving throughput for workloads that leave resources idle.
  • Memory-latency-bound and mixed workloads benefit most; compute-saturated workloads may not benefit or may regress.
  • SMT roughly halves per-thread cache and predictor capacity, which can degrade performance for cache-sensitive code.
  • Security-sensitive environments should evaluate SMT side-channel risks.
  • Always measure with and without SMT on your specific workload before deciding.

The Fundamental Gamble

Modern out-of-order processors do not stop and wait when they encounter uncertainty. When a branch outcome is unknown, the CPU predicts the result and continues executing down the predicted path. When a load address is not yet computed, the CPU speculates that it does not conflict with pending stores. When a cache miss occurs, the CPU continues executing independent instructions that do not depend on the missing data.

This is speculative execution — the CPU performs work that may turn out to be unnecessary or incorrect, betting that the speculation will usually be right and the work will pay off. When the bet wins (which is the vast majority of the time), the CPU runs far faster than a stall-on-every-uncertainty design. When the bet loses, the speculative work must be discarded and the correct path executed from scratch.

What Gets Speculated

Speculation is pervasive in modern microarchitectures. The major forms:

Branch Prediction Speculation

The most visible form. The branch predictor (covered in Branch Prediction) predicts the direction (taken/not-taken) and target of every branch instruction. The CPU fetches, decodes, and executes instructions along the predicted path without waiting for the branch condition to resolve.

All instructions executed on the speculative path are marked as speculative in the reorder buffer (ROB). They execute normally, read and write rename registers, access caches, and consume execution resources. But their results are not committed (made architecturally visible) until the branch is resolved and confirmed correct.

Memory Disambiguation Speculation

When a load executes, prior stores in the store buffer may not yet have computed their addresses. The memory disambiguator predicts that the load does not alias any pending store and lets it proceed (see Section 16: Store Buffer). If the prediction is wrong, a memory ordering violation is detected and the load (and all subsequent instructions) must be re-executed.

Value Prediction (Emerging)

Some recent designs experiment with predicting the actual value a load will return, allowing dependent instructions to execute speculatively using the predicted value. If correct, this breaks true data dependencies. If wrong, a pipeline flush occurs. Value prediction is not yet mainstream but exists in limited forms in Apple’s M-series cores.

Cache Prefetch Speculation

Hardware prefetchers (see Section 18) speculatively fetch data that might be needed. While not speculation in the pipeline-flush sense, incorrect prefetches waste bandwidth and cache capacity.

The Cost of Being Wrong: Misprediction Recovery

When a branch resolves to a different outcome than predicted, the CPU must:

  1. Identify the misprediction point in the ROB.
  2. Flush all instructions younger than the mispredicted branch from the pipeline — this includes instructions in the ROB, reservation stations, load/store buffers, and any in-flight cache requests initiated by speculative loads.
  3. Restore the register rename state to the checkpoint saved at the branch point.
  4. Redirect the frontend to fetch from the correct target address.
  5. Refill the pipeline — the frontend must fetch, decode, and dispatch instructions along the correct path before the backend has useful work again.

The total penalty is determined by the pipeline depth from fetch to execute. On modern cores:

MicroarchitectureMisprediction Penalty
Intel Skylake (2015)~15-17 cycles [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Golden Cove (2021)~15-17 cycles [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
Intel Lion Cove (2024)~15-17 cycles (estimated) [Source: Chips and Cheese, “Lion Cove Microarchitecture”, 2024]
AMD Zen 3 (2020)~11-18 cycles [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
AMD Zen 4 (2022)~11-13 cycles [Source: Agner Fog, “The Microarchitecture of Intel and AMD CPUs”, 2023]
AMD Zen 5 (2024)~11-13 cycles (estimated, similar pipeline depth to Zen 4)
ARM Cortex-X4 (2023)~11-13 cycles (estimated) [Source: ARM, Cortex-X4 TRM, 2023]
Apple M1 Firestorm (2020)~14 cycles [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]

AMD Zen cores historically have shorter misprediction penalties than Intel (11-13 cycles vs. 15-17), reflecting their somewhat shallower pipeline. However, AMD’s penalty range is wider depending on where in the pipeline the misprediction is detected. Apple’s M1 Firestorm core sits between the two at ~14 cycles, despite having a very wide core [Source: Dougall Johnson, “Apple M1 Firestorm Microarchitecture”, 2021]. ARM Cortex-X4’s shorter pipeline and narrower mispredict window also contribute to a lower penalty than Intel.

During these penalty cycles, the backend drains the incorrectly speculated work and has little useful work to do. The effective throughput during recovery is near zero.

Quantifying the Impact

The performance impact of branch mispredictions can be estimated:

Penalty = Misprediction_Rate x Branches_Per_1000_Instructions x Penalty_Cycles / 1000

For a workload with 15% of instructions being branches, a 5% misprediction rate, and a 15-cycle penalty:

Penalty = 0.05 x 150 x 15 / 1000 = 0.1125 cycles per instruction

On a core achieving 3 IPC, this adds roughly 4% overhead. At 10% misprediction rate, the overhead doubles. At 20% (common in poorly predicted workloads), branch mispredictions become the dominant bottleneck.

Pipeline Flush Mechanics

The flush mechanism is more nuanced than “throw everything away.” Modern CPUs use checkpointing to speed up recovery:

Checkpoint-Based Recovery

At each predicted branch, the CPU saves a snapshot (checkpoint) of the rename table — the mapping from architectural registers to physical registers. When a misprediction is detected, the rename table is restored from the checkpoint in a single cycle, rather than rewinding instructions one by one.

The number of available checkpoints limits how many unresolved branches can be in flight simultaneously. Intel cores typically support 48-72 in-flight branches. If this limit is reached, the frontend stalls until a branch resolves.

Selective Flush

Some implementations can flush only the instructions younger than the mispredicted branch, preserving older instructions that are still correct. This is the standard approach in all modern designs and is facilitated by the ROB’s in-order retirement property.

Spectre: When Speculation Becomes a Vulnerability

In January 2018, the Spectre family of vulnerabilities revealed that speculative execution has security implications. The core insight: even though speculative instructions are architecturally rolled back on misprediction, they leave microarchitectural side effects that persist:

  • Cache state changes: A speculative load brings data into the cache. Even after the load is flushed, the data remains cached. An attacker can detect which cache lines were loaded via timing measurements (Flush+Reload, Prime+Probe).
  • TLB state changes: Speculative page walks modify TLB state.
  • Branch predictor training: Speculative branches update predictor state.

Spectre Variant 1 (Bounds Check Bypass)

if (x < array_size) {           // branch predicted taken with malicious x
    y = array2[array1[x] * 256]; // speculative access leaks array1[x] via cache
}

The CPU speculatively executes the body with an out-of-bounds x before the bounds check resolves. The access to array1[x] brings secret data into a register. The dependent access to array2[...] loads a cache line determined by the secret value. After the misprediction is detected, the registers are rolled back — but the cache line remains. The attacker can probe array2 to determine which line was loaded, recovering the secret byte.

Mitigations and Their Costs

Spectre mitigations impose real performance overhead:

MitigationTargetOverhead
RetpolineVariant 2 (BTB)5-15%
IBRS/STIBPVariant 21-5%
LFENCE after branchesVariant 1Variable
Array index maskingVariant 1Less than 1%
Core scheduling (no SMT)Cross-threadLoss of SMT

These mitigations fundamentally limit the aggressiveness of speculation — either by inserting barriers, restricting predictor training, or avoiding shared microarchitectural state. This tension between performance and security is an ongoing challenge in CPU design.

Speculation Depth and the Reorder Buffer

The ROB size determines the speculation depth — how far ahead the CPU can speculatively execute beyond an unresolved branch. Larger ROBs allow deeper speculation, which improves performance for long-latency operations but also amplifies the waste when speculation is wrong.

CoreROB SizeMax Speculation Depth
Intel Skylake224~224 uops
Intel Golden Cove512~512 uops
AMD Zen 4320~320 uops
Apple M1 Firestorm~630~630 uops

Golden Cove’s 512-entry ROB can speculatively execute over 500 instructions past an unresolved branch. When correct, this provides massive ILP extraction. When wrong, all 500 instructions are flushed.

Key Takeaways

  • Speculation is essential for performance: without it, modern CPUs would be 2-5x slower on typical code.
  • Branch mispredictions cost 11-17 cycles of pipeline flush, proportional to pipeline depth.
  • Checkpoint-based recovery enables fast rename table restoration but limits in-flight branch count.
  • Spectre demonstrated that speculative side effects have security implications; mitigations impose measurable performance costs.
  • The tradeoff between speculation depth (ROB size) and misprediction cost is a fundamental tension in CPU design.

What Are Pipeline Hazards?

A pipelined processor overlaps the execution of multiple instructions — while one instruction is in the execute stage, the next is decoding, and the one after that is being fetched. This overlap is the source of pipelining’s throughput advantage, but it also creates situations where the next instruction cannot proceed because of a dependency or resource conflict with an instruction already in the pipeline. These situations are called pipeline hazards.

Hazards are classified into three categories: data hazards, structural hazards, and control hazards. Understanding all three is essential for reasoning about pipeline stalls and performance bottlenecks.

Data Hazards

Data hazards arise when instructions have dependencies on data produced or consumed by other instructions in the pipeline. There are three types, named by the order of the conflicting read (R) and write (W) operations:

RAW (Read After Write) — True Dependency

The most common and most important hazard. A later instruction reads a value that an earlier instruction writes:

add  rax, rbx       ; writes rax
imul rcx, rax, 5    ; reads rax -- must wait for add to produce rax

The imul cannot execute until add has produced its result. In a simple pipeline without forwarding, this would cause a multi-cycle stall. Modern CPUs resolve RAW hazards via bypass/forwarding networks — the result of the add is forwarded directly from the ALU output to the imul input, bypassing the register file writeback stage. This reduces the penalty to the instruction’s inherent latency (1 cycle for add, so the imul can execute in the very next cycle).

For high-latency instructions (division: 20-90 cycles, cache-missing loads: 200+ cycles), RAW hazards create long stall chains that even out-of-order execution may not fully hide. The out-of-order engine (see Reorder Buffer) mitigates this by executing independent instructions while waiting for the dependency to resolve.

WAR (Write After Read) — Anti-Dependency

A later instruction writes to a register that an earlier instruction reads:

mov  rbx, [rax]     ; reads rax
add  rax, 8         ; writes rax -- anti-dependency on the read above

In an in-order pipeline, this is trivially handled because the read completes before the write. In an out-of-order pipeline, WAR hazards are eliminated entirely by register renaming (see Register Renaming). The add writes to a different physical register than the one the mov read from, so there is no actual conflict.

WAW (Write After Write) — Output Dependency

Two instructions write to the same architectural register:

mov  rax, [rbx]     ; writes rax
mov  rax, [rcx]     ; writes rax -- output dependency

In an in-order pipeline, the second write must occur after the first to preserve program semantics. In an out-of-order pipeline, register renaming again eliminates this hazard — each mov writes to a different physical register, and the rename table is updated to point to the latest one.

Summary: Register Renaming Solves WAR and WAW

Hazard TypeDirectionSolved ByRemaining Cost
RAWTrue depForwarding + OoO schedulingInstruction latency
WARAnti depRegister renamingNone
WAWOutput depRegister renamingNone

This is why register renaming is considered one of the most important innovations in out-of-order execution — it converts two of the three data hazard types from real stalls into non-issues.

Structural Hazards

Structural hazards occur when two instructions need the same hardware resource in the same cycle, and the resource cannot service both simultaneously.

Execution Port Contention

Modern backends have a limited number of execution ports, each connected to specific functional units (see Execution Ports). If two instructions both require the same port and are ready to execute in the same cycle, one must wait:

; Both require port 1 on Intel (slow LEA and FP add)
lea  rax, [rbx + rcx*4 + 16]   ; complex LEA -> port 1
vaddps xmm0, xmm1, xmm2        ; FP add -> port 0 or 1

Port contention is a structural hazard. The scheduler resolves it by delaying one instruction by one cycle. In throughput-critical loops, sustained port contention limits the loop’s throughput to the contended port’s capacity.

Identifying port contention requires examining which ports each instruction uses, typically via tools like llvm-mca, uiCA, or Intel’s IACA (see Section 26: Measurement).

Memory Port Contention

Most cores support 2 loads and 1 store per cycle through the L1d cache. If a loop iteration requires 3 loads, it cannot sustain one iteration per cycle even if all other resources are available. This is a structural hazard on the load ports.

Other Structural Hazards

  • Rename register exhaustion: When all physical registers are allocated, dispatch stalls.
  • ROB full: When the reorder buffer is full, no new instructions can enter the backend.
  • Store buffer full: No new stores can be dispatched (see Section 16: Store Buffer).
  • Scheduler full: Reservation station entries exhausted.
  • Division unit busy: Only 1 divider per core; back-to-back divisions serialize.

Control Hazards

Control hazards arise from branches and other instructions that alter the flow of control. The CPU must know the next instruction’s address to continue fetching, but branch outcomes and targets are not known until the branch executes.

Branch Misprediction

The most significant control hazard. When the branch predictor is wrong, all speculatively fetched and executed instructions on the wrong path must be flushed (see Section 23: Speculative Execution). The penalty is 11-17 cycles on modern cores.

Indirect Branches

Indirect branches (jmp [rax], call [vtable + offset]) have variable targets that are harder to predict. The BTB (Branch Target Buffer) and indirect branch predictor (see Branch Prediction) handle these, but misprediction rates are typically higher than for conditional branches, especially with many possible targets (virtual dispatch, switch statements with many cases, interpreted language dispatch).

Serializing Instructions

Some instructions are inherently serializing: CPUID, MFENCE (on some implementations), and writes to control registers force the pipeline to drain before continuing. These create full-pipeline control hazards.

Resolution Mechanisms Summary

Hazard TypeMechanismResidual Cost
RAW (true data dep)Forwarding + OoO schedulingProducer latency
WAR (anti dep)Register renamingNone
WAW (output dep)Register renamingNone
Structural (ports)Scheduler delays conflicting insn1 cycle per conflict
Structural (buffers)Pipeline stall until resource freesVariable (depends on drain)
Control (branch)Branch prediction + speculation11-17 cycles on mispredict
Control (serializing)Pipeline drainFull pipeline depth

Practical Implications

When optimizing code at the microarchitectural level:

  1. Minimize long RAW chains: Interleave independent work to let the OoO engine overlap latencies. Unrolling and software pipelining help here.

  2. Watch for port pressure: Use llvm-mca or uiCA to identify port bottlenecks. Rearranging instruction mix or using alternative instructions on different ports can improve throughput.

  3. Reduce branch mispredictions: Use branchless techniques (see Section 20: Predication) for data-dependent branches with poor predictability.

  4. Avoid resource exhaustion: Deep speculation or too many in-flight memory operations can exhaust ROB, load buffer, or store buffer entries. Shorter instruction windows (via loop splitting or tiling) can reduce pressure.

Pipeline Hazard Visualizer

Enter instructions to detect RAW, WAR, and WAW data hazards. See how the CPU resolves each type.

Instructions
0MUL R1, R2, R3writes R1, reads R2, R3
1ADD R4, R1, R5writes R4, reads R1, R5
2MOV R1, R6writes R1, reads R6
3SUB R7, R1, R8writes R7, reads R1, R8
Detected Hazards (5)
RAW

R1: instruction 0 (MUL R1, R2, R3)instruction 1 (ADD R4, R1, R5)

Read-After-Write (True Dependency)Resolved via data forwarding / bypass network, or stall if unavoidable

WAW

R1: instruction 0 (MUL R1, R2, R3)instruction 2 (MOV R1, R6)

Write-After-Write (Output Dependency)Resolved via register renaming — each write targets a different physical register

WAR

R1: instruction 1 (ADD R4, R1, R5)instruction 2 (MOV R1, R6)

Write-After-Read (Anti Dependency)Resolved via register renaming — destination gets a new physical register

RAW

R1: instruction 0 (MUL R1, R2, R3)instruction 3 (SUB R7, R1, R8)

Read-After-Write (True Dependency)Resolved via data forwarding / bypass network, or stall if unavoidable

RAW

R1: instruction 2 (MOV R1, R6)instruction 3 (SUB R7, R1, R8)

Read-After-Write (True Dependency)Resolved via data forwarding / bypass network, or stall if unavoidable

RAWRead-After-WriteWARWrite-After-ReadWAWWrite-After-Write

Key Takeaways

  • Data hazards (RAW, WAR, WAW) are the most common; register renaming eliminates WAR and WAW entirely, leaving only true (RAW) dependencies.
  • Structural hazards from execution port contention are a subtle but real throughput limiter in tight loops.
  • Control hazards from branch misprediction are the most expensive hazard type, costing 11-17 cycles per misprediction.
  • Out-of-order execution, register renaming, and branch prediction together eliminate or mitigate most hazards, but the residual costs (instruction latency, port contention, misprediction flushes) are what performance engineers optimize around.

Bridging Theory and Practice

The previous sections described individual microarchitectural mechanisms — caches, branch prediction, out-of-order execution, SIMD. Real optimization requires combining this knowledge to diagnose and fix performance problems in actual code. This section presents four case studies, each targeting a different microarchitectural bottleneck.


Case Study 1: Loop Tiling for Cache Locality

The Problem

Matrix multiplication is the canonical example of a cache-unfriendly algorithm. The naive triple-loop implementation:

for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

The inner loop iterates over k, accessing A[i][k] with stride 1 (good) but B[k][j] with stride N (bad — each access is a different row, likely a different cache line). For large N, every access to B misses in cache, generating enormous DRAM traffic.

The Fix: Tiling (Blocking)

Loop tiling restructures the computation to operate on small blocks that fit in the L1d or L2 cache:

#define TILE 64
for (int ii = 0; ii < N; ii += TILE)
    for (int jj = 0; jj < N; jj += TILE)
        for (int kk = 0; kk < N; kk += TILE)
            for (int i = ii; i < ii + TILE; i++)
                for (int j = jj; j < jj + TILE; j++)
                    for (int k = kk; k < kk + TILE; k++)
                        C[i][j] += A[i][k] * B[k][j];

Each TILE x TILE block of B is loaded into cache once and reused across TILE iterations of the outer loops. With TILE = 64 and double elements, each block is 64 x 64 x 8 = 32 KB — fits comfortably in L1d (see Section 15: Cache Hierarchy).

Impact

For N = 1024 with double-precision elements:

  • Naive: ~10 billion cache misses, ~0.5 GFLOPS
  • Tiled (64x64): ~100 million cache misses, ~5-8 GFLOPS
  • Tiled + SIMD + unrolled: ~15-20 GFLOPS

Loop tiling delivers 10-40x speedup by converting a memory-bound computation into a compute-bound one.

Loop Tiling (Cache Blocking)

Compare naive column-major traversal vs tiled access pattern. Tiling keeps data in L1 cache.

Cache Miss Rate: 40.0%IPC: 0.80Mode: Naive (column-major)
Memory access pattern
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63

Numbers show access order. Shade indicates access time.

Cache miss rate comparison
Naive (column-major)40.0%
Tiled (32×32)40.0%
IPC comparison
Naive0.80
Tiled0.80

Case Study 2: Branchless Programming

The Problem

Filtering an array based on data-dependent conditions:

int count = 0;
for (int i = 0; i < N; i++) {
    if (data[i] < threshold) {
        output[count++] = data[i];
    }
}

If the data distribution is random, the branch data[i] < threshold is unpredictable (~50% taken rate). On a core with a 15-cycle misprediction penalty, roughly half the iterations incur the full penalty, reducing effective throughput to a small fraction of peak.

The Fix: Branchless Filtering

Replace the branch with arithmetic that always writes and conditionally advances the output pointer:

int count = 0;
for (int i = 0; i < N; i++) {
    output[count] = data[i];
    count += (data[i] < threshold);  // 0 or 1, no branch
}

This version always writes to output[count]. If the condition is false, count is not incremented, so the next iteration overwrites the same slot. There is no branch, so there is no misprediction.

A more sophisticated approach avoids unnecessary stores by using SIMD compression instructions (AVX-512 vpcompressd) or lookup-table-based permutation patterns.

Impact

For uniformly distributed random data with 50% selectivity:

  • Branching: ~1.5 cycles/element (dominated by mispredictions)
  • Branchless: ~0.5 cycles/element

The branchless version is 3x faster. As selectivity approaches 0% or 100% (highly predictable), the branching version catches up because the predictor achieves near-perfect accuracy and the speculative path is correct (see Section 20: Predication).

Branchless Programming

Compare conditional branches vs CMOVcc (branchless) with sorted and random data.

Branch Misprediction Rate: 50%Mispredictions: 500IPC: 0.40Total Cycles: 12,500
Branchy (JCC)● Active
if (a[i] > threshold)
  sum += a[i];    // JG .add
                  // branch taken/not
Mispredict rate50%
IPC0.40
Total cycles12,500
Branchless (CMOVcc)
tmp = (a[i] > threshold)
      ? a[i] : 0; // CMOVcc
sum += tmp;       // no branch
Mispredict rate0%
IPC0.87
Total cycles5,750

Random data: 50% misprediction rate causes 500 pipeline flushes at 15 cycles each. CMOVcc avoids all mispredictions for a 2.2x speedup.


Case Study 3: Port Contention in Cryptographic Code

The Problem

AES encryption using a software (table-lookup) implementation:

for (int round = 0; round < 10; round++) {
    for (int i = 0; i < 4; i++) {
        t[i] = Te0[s[0]] ^ Te1[s[1]] ^ Te2[s[2]] ^ Te3[s[3]] ^ rkey[i];
    }
}

Each round performs 16 table lookups (loads), 12 XORs, and index computations. Profiling reveals the loop runs at 4 cycles/round despite having only ~20 instructions — far below the theoretical throughput.

Diagnosis

Using perf stat with port-level counters (or llvm-mca), the bottleneck is identified as load port saturation: 16 loads per round on a core with 2 load ports = 8 cycles minimum. The 4-cycle observed throughput means some loads are hitting the same cache line and coalescing, but the load ports are still the constraint.

The Fix

Use AES-NI hardware instructions (AESENC, AESDEC) which perform the entire round in a single instruction executing on a dedicated port. The AES-NI version processes one round per cycle — a 4-8x improvement. This is a structural hazard (see Section 24: Hazards) resolved by moving computation to a different, unconstrained execution unit.

When hardware instructions are not available (e.g., non-AES algorithms), the mitigation is to interleave multiple independent blocks so that load latency is hidden by out-of-order execution across blocks. Processing 4 blocks simultaneously gives the scheduler enough independent loads to fill pipeline bubbles.


Case Study 4: AoS vs. SoA Data Layout

The Problem

Processing particle positions in a physics simulation:

// Array of Structures (AoS)
struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    int   type;
    // 32 bytes total
};
Particle particles[N];

// Update positions
for (int i = 0; i < N; i++) {
    particles[i].x += particles[i].vx * dt;
    particles[i].y += particles[i].vy * dt;
    particles[i].z += particles[i].vz * dt;
}

The position update only reads x, y, z, vx, vy, vz (24 bytes) but each Particle occupies 32 bytes. One cache line (64 bytes) holds 2 particles, meaning 75% of the loaded data is useful. More importantly, the auto-vectorizer struggles because the x, y, z fields are interleaved with other data, preventing clean SIMD loads.

The Fix: Structure of Arrays (SoA)

// Structure of Arrays (SoA)
struct Particles {
    float *x, *y, *z;
    float *vx, *vy, *vz;
    float *mass;
    int   *type;
};

// Update positions -- trivially vectorizable
for (int i = 0; i < N; i++) {
    x[i] += vx[i] * dt;
    y[i] += vy[i] * dt;
    z[i] += vz[i] * dt;
}

Now each array contains only one field. Cache lines are 100% utilized for the fields being accessed. The loops are trivially auto-vectorized (see Section 21: SIMD) — the compiler emits AVX2 code processing 8 particles per instruction.

Impact

LayoutBandwidth UtilizationVectorized?Cycles/Particle
AoS~75%Partially~4.0
SoA~100%Fully~0.8

SoA is 5x faster for this kernel due to better cache utilization and full SIMD vectorization. The tradeoff is that SoA makes single-particle access less ergonomic and can be harder to maintain in large codebases. A common compromise is AoSoA (Array of Structures of Arrays), which groups small batches (e.g., 8 or 16 elements) into SIMD-width structures:

struct ParticleBlock {
    float x[8], y[8], z[8];
    float vx[8], vy[8], vz[8];
};

This provides SIMD-friendly access within blocks while maintaining some of AoS’s organizational benefits.

Array of Structures vs Structure of Arrays

Compare AoS and SoA memory layouts for cache utilization when accessing a subset of fields.

Layout: AoSCache Line Utilization: 38%Lines Loaded: 8Miss Rate: 35%
Cache line contents (AoS — accessing x, y, z only)
obj[0]
x
y
z
vx
vy
vz
mass
charge
64B line
obj[1]
x
y
z
vx
vy
vz
mass
charge
64B line
obj[2]
x
y
z
vx
vy
vz
mass
charge
64B line
obj[3]
x
y
z
vx
vy
vz
mass
charge
64B line
obj[4..7]
x
y
z
vx
vy
vz
mass
charge
64B line
Used bytesWasted bytes
Comparison
AoS35% miss rate · 38% utilization
SoA5% miss rate · 100% utilization

AoS layout selected. All fields of each object are contiguous. When only accessing x, y, z (3 of 8 fields), 63% of loaded bytes are wasted on unused fields (vx, vy, vz, mass, charge).


Cross-Cutting Themes

These case studies share common patterns:

  1. Measure first: Use hardware counters to identify the actual bottleneck before optimizing (see Section 26: Measurement).
  2. Data layout drives performance: Two of the four studies are fundamentally about how data is arranged in memory, not how instructions are written.
  3. Know your microarchitecture: Port contention, branch misprediction, and cache behavior are architecture-specific. What is optimal on Intel may differ on AMD or ARM.
  4. Algorithmic changes beat micro-optimization: Loop tiling delivers 10-40x; instruction scheduling might deliver 10-20%. Always look for algorithmic wins first.

You Cannot Optimize What You Cannot Measure

Microarchitectural optimization without measurement is guesswork. A function that “looks slow” may be executing efficiently but called too often. A loop that “should be fast” may be bottlenecked on cache misses invisible in the source code. The CPU provides hardware Performance Monitoring Counters (PMCs) — special-purpose registers that count microarchitectural events — giving direct visibility into what the hardware is actually doing.

Performance Monitoring Unit (PMU)

Every modern CPU contains a Performance Monitoring Unit (PMU) — dedicated hardware that counts events such as:

  • Instructions retired
  • CPU cycles (core clock and reference clock)
  • Cache hits and misses at each level (L1d, L1i, L2, L3)
  • Branch predictions and mispredictions
  • TLB hits and misses
  • Store buffer drains and stalls
  • Execution port utilization
  • Micro-op dispatches per port

x86 CPUs typically have 4-8 programmable counters per core (configured via MSRs) plus 3 fixed counters (instructions retired, core cycles, reference cycles). ARM cores have a similar PMU with 6-8 programmable counters.

Vendor-Specific PMU Event Names for Key Counters

The same conceptual event has different names across Intel and AMD PMUs. The table below maps key counters:

MetricIntel Event NameAMD Event Name (Zen 3/4)
Instructions retiredINST_RETIRED.ANY (fixed counter 0)RETIRED_INSTRUCTIONS (PMCx0C0) [Source: AMD, PPR for AMD Family 19h, 2022]
Core cyclesCPU_CLK_UNHALTED.THREAD (fixed counter 1)ACTUAL_CYCLES:NOT_HALTED (PMCx076)
Reference cyclesCPU_CLK_UNHALTED.REF_TSC (fixed counter 2)N/A (use mperf/aperf MSRs)
Branch mispredictionsBR_MISP_RETIRED.ALL_BRANCHESRETIRED_BRANCH_INSTRUCTIONS_MISPREDICTED (PMCx0C3)
L1d cache missesMEM_LOAD_RETIRED.L1_MISSL1_DATA_CACHE_REFILLS_FROM_SYSTEM:ALL (PMCx043)
L2 cache missesMEM_LOAD_RETIRED.L2_MISSCORE_TO_L2_CACHEABLE_REQUEST_ACCESS_STATUS:LS_RD_BLK_C
LLC (L3) missesLONGEST_LAT_CACHE.MISSL3_CACHE_ACCESS:MISS (L3PMCx06)
TLB misses (DTLB)DTLB_LOAD_MISSES.WALK_COMPLETEDL1_DTLB_MISS.ALL (PMCx045)
Micro-ops retiredUOPS_RETIRED.RETIRE_SLOTSRETIRED_UOPS (PMCx0C1)

[Source: Intel, Intel 64 and IA-32 Architectures SDM Vol 3, Chapter 19, 2023; AMD, Processor Programming Reference for AMD Family 19h, 2022]

In practice, perf on Linux abstracts most of these behind generic event names (cycles, instructions, cache-misses, branch-misses), but knowing the underlying hardware event names is essential when using raw events (perf stat -e r<event_code>) or when counter results seem inconsistent between platforms.

ARM PMU events follow the ARMv8 PMU specification with events like INST_RETIRED (0x08), BR_MIS_PRED_RETIRED (0x22), and L1D_CACHE_REFILL (0x03). ARM events are more standardized across implementations than x86, though vendors can add implementation-defined events [Source: ARM, ARM Architecture Reference Manual ARMv8-A, 2023].

Because the number of counters is limited, measuring many events simultaneously requires multiplexing — the tool periodically switches which events are counted and estimates totals via scaling. This introduces statistical error, so critical measurements should be done with a minimal set of events.

Tools of the Trade

perf (Linux)

The primary tool for hardware-level performance analysis on Linux. It provides both counting and sampling modes:

# Counting mode: aggregate event counts for the entire run
perf stat -e cycles,instructions,cache-misses,branch-misses ./my_program

# Sampling mode: record which code locations generate events
perf record -e cycles -g ./my_program
perf report

Counting mode (perf stat) is the starting point for any investigation. Key derived metrics:

  • IPC (Instructions Per Cycle): instructions / cycles. Theoretical max is the core’s dispatch width (4-6 on modern x86). An IPC of 1.0 means the core is 75-83% underutilized.
  • Branch misprediction rate: branch-misses / branches. Above 2-3% for hot loops indicates a branch prediction problem.
  • Cache miss rate: cache-misses / cache-references. High L3 miss rates indicate memory-bound behavior.
# Example output
$ perf stat ./matrix_multiply
  12,345,678,901  cycles
   8,234,567,890  instructions              #    0.67  IPC
     234,567,890  LLC-load-misses           #   23.4%  of LLC loads
       1,234,567  branch-misses             #    0.5%  of branches

An IPC of 0.67 combined with 23.4% LLC miss rate clearly indicates a memory-bound workload. The fix is likely data layout or cache tiling (see Section 25: Case Studies).

Intel VTune Profiler

VTune provides a GUI-based analysis with pre-built analysis types:

  • Microarchitecture Exploration: Automatically runs TMA (see below) and identifies the top bottleneck category.
  • Memory Access Analysis: Tracks cache miss sources, NUMA locality, and bandwidth utilization.
  • Hotspot Analysis: Identifies the functions and source lines consuming the most CPU time.
  • Threading Analysis: Detects lock contention, load imbalance, and synchronization overhead.

VTune can attribute events to source lines and even assembly instructions, making it invaluable for understanding per-instruction behavior in critical loops.

AMD uProf

AMD’s equivalent of VTune, designed for Zen-family processors. It provides:

  • IBS (Instruction-Based Sampling): AMD’s precise event sampling technology.
  • Cache and memory hierarchy analysis tuned to Zen’s non-inclusive L3 and chiplet topology.
  • Power and thermal profiling.

LIKWID

A lightweight, command-line-focused toolkit particularly popular in HPC:

# Measure memory bandwidth and FLOPS for a specific code region
likwid-perfctr -C 0 -g MEM_DP ./my_program

# Pin threads to cores for reproducible measurements
likwid-pin -c 0-7 ./my_program

LIKWID provides pre-defined performance groups (collections of related counters) for common analyses: memory bandwidth, FLOP rates, cache utilization, branch behavior. It also supports marker API calls to measure specific code regions rather than the entire application.

llvm-mca and uiCA (Static Analysis)

For analyzing small loops at the instruction level without running the code:

# Analyze a loop body with llvm-mca
echo "
vaddps ymm0, ymm1, ymm2
vmulps ymm3, ymm4, ymm5
vmovaps [rdi], ymm0
vmovaps [rsi], ymm3
add rdi, 32
add rsi, 32
cmp rdi, rdx
jl .loop
" | llvm-mca -mcpu=skylake -iterations=100

llvm-mca simulates the pipeline and reports throughput, bottleneck ports, and resource pressure. uiCA (uops.info Code Analyzer) provides more accurate modeling for Intel microarchitectures based on measured instruction data from uops.info.

These static tools are invaluable for answering “what is the theoretical throughput of this loop?” without the noise of OS scheduling, TLB misses, and memory hierarchy effects.

Top-Down Microarchitecture Analysis (TMA)

TMA is a systematic methodology developed by Intel (and adopted broadly) for categorizing pipeline slots into four top-level buckets:

  1. Retiring: Slots that performed useful work (instructions retired). Higher is better.
  2. Bad Speculation: Slots wasted on mispredicted branches or machine clears.
  3. Frontend Bound: Slots where the frontend could not deliver micro-ops to the backend (instruction cache misses, decode stalls, DSB coverage gaps — see Instruction Fetch).
  4. Backend Bound: Slots where the backend could not accept micro-ops due to resource exhaustion or long-latency operations.

Backend Bound further splits into:

  • Memory Bound: Stalls due to cache misses, TLB misses, store buffer pressure.
  • Core Bound: Stalls due to execution port contention, long-latency arithmetic, dependency chains.

Using TMA in Practice

# perf with TMA top-level metrics (Intel platforms)
perf stat -M TopdownL1 ./my_program

# Example output
   Retiring:        25.3%
   Bad Speculation:   8.1%
   Frontend Bound:   12.4%
   Backend Bound:    54.2%

This tells us immediately: the workload is backend-bound (54.2%), with only 25.3% of slots doing useful work. Drilling into the Backend Bound category:

perf stat -M TopdownL2 ./my_program
#   Memory Bound:    41.8%
#   Core Bound:      12.4%

The bottleneck is memory. Further drill-down (Level 3+) identifies whether it is L1 bound, L2 bound, L3 bound, or DRAM bound, guiding the optimization directly.

TMA is a tree — you start at the top and drill into the dominant category at each level until you reach an actionable leaf node.

Top-Down Microarchitecture Analysis (TMA)

Explore CPU pipeline bottlenecks using Intel's TMA methodology. Click a category to drill down.

Pipeline slots breakdown
Category details
Frontend Bound
15.0%
Bad Speculation
15.0%
Backend Bound
35.0%
Retiring
35.0%

Primary bottleneck: Backend Bound at 35% of pipeline slots. Drill down to see if this is memory-bound or core-bound.

Retiring (35%) represents useful work — higher is better.

Measurement Best Practices

Reduce Noise

  • Pin processes to cores: Use taskset or numactl on Linux, processor affinity on Windows.
  • Disable frequency scaling: Set the governor to performance mode (cpupower frequency-set -g performance).
  • Warm up: Run the workload once before measuring to fill caches and TLBs.
  • Repeat and take the minimum: The minimum of multiple runs best represents the true hardware capability, as noise only adds time.

Avoid Common Pitfalls

  • Do not rely on wall-clock time alone: It includes OS scheduling, I/O waits, and other threads’ interference.
  • Multiplexing distortion: If perf reports scaling factors above 2x, reduce the number of simultaneous events.
  • Context switches reset counters: For short benchmarks, use perf stat in counting mode rather than sampling mode.
  • Compiler optimization levels matter: Always measure at the optimization level you will ship (-O2 or -O3). -O0 code has entirely different bottlenecks.

The Measurement Workflow

  1. Start with perf stat to get IPC, cache miss rates, and branch misprediction rates.
  2. Run TMA (TopdownL1, then L2) to identify the bottleneck category.
  3. Drill down with specific counters (e.g., l2_lines_in.all for L2 misses, br_misp_retired.all_branches for branch mispredictions).
  4. Use sampling (perf record + perf report or VTune hotspots) to locate the specific code region responsible.
  5. Analyze the loop with llvm-mca or uiCA for instruction-level understanding.
  6. Optimize and re-measure — verify the fix actually improved the metric you targeted.

Key Takeaways

  • Hardware performance counters provide direct visibility into microarchitectural behavior; always measure before and after optimizing.
  • Start with perf stat for aggregate metrics, then use TMA to systematically identify the bottleneck category.
  • Use static analysis tools (llvm-mca, uiCA) for instruction-level loop analysis without runtime noise.
  • Pin cores, disable frequency scaling, and take minimum-of-N-runs for reproducible measurements.
  • The measurement workflow is: identify bottleneck category (TMA) -> locate hot code (sampling) -> analyze loop (static analysis) -> optimize -> re-measure.