LLM Inference at Scale

A comprehensive tutorial on modern large language model inference: from token generation fundamentals to multi-GPU deployment strategies.

Core AI Interview Preparation | January 2026

MODULE 1

Inference Fundamentals

Before diving into optimizations, you need to understand what happens when an LLM generates text. Unlike training (which processes entire sequences in parallel), inference is fundamentally sequential for autoregressive models.

The Autoregressive Process

Decoder-only transformers (GPT, Llama, etc.) generate text one token at a time. Each new token depends on all previous tokens. The model:

  1. Takes the current sequence as input
  2. Computes attention over all tokens
  3. Predicts a probability distribution over the vocabulary
  4. Samples the next token
  5. Appends it to the sequence and repeats

The Quadratic Problem

Without optimization, generating N tokens requires N forward passes, and each pass computes attention over an increasingly long sequence. Naive attention is O(n²) in sequence length. For a 100K context, this becomes computationally prohibitive.

Two Phases of Inference

LLM inference splits into two distinct phases with very different computational characteristics:

Phase 1: Prefill (Prompt Processing)

  • What happens: Process all input tokens in parallel
  • Computational profile: Compute-bound (limited by FLOPS)
  • Memory access pattern: Sequential, predictable
  • Parallelism: High - all input tokens processed together
  • Output: KV cache populated, first token generated

Phase 2: Decode (Token Generation)

  • What happens: Generate tokens one at a time
  • Computational profile: Memory-bandwidth-bound
  • Memory access pattern: Random access to KV cache
  • Parallelism: Low - sequential token generation
  • Bottleneck: Reading model weights + KV cache from HBM

Why Decode is Memory-Bound

During decode, we generate one token but must read the entire model weights and KV cache. For a 70B parameter model in FP16, that's 140GB of data read to produce a single token. The arithmetic intensity (FLOPS per byte) is extremely low, making HBM bandwidth the bottleneck.

Key Performance Metrics

MetricDefinitionWhat It Measures
TTFTTime to First TokenPrefill latency - how fast prompt is processed
TPOTTime Per Output TokenDecode latency - time between tokens
ITLInter-Token LatencyVariance in token generation time
ThroughputTokens/second (system)Total capacity across all requests
LatencyEnd-to-end timeTTFT + (output_tokens × TPOT)

The Memory Hierarchy

Understanding where data lives is crucial for optimization:

┌─────────────────────────────────────────────────────────┐
│  Registers      ~20 KB    ~10 TB/s    Fastest, smallest │
├─────────────────────────────────────────────────────────┤
│  Shared Memory  ~200 KB   ~19 TB/s    Per SM, programmer│
├─────────────────────────────────────────────────────────┤
│  L2 Cache       ~50 MB    ~5 TB/s     Shared across GPU │
├─────────────────────────────────────────────────────────┤
│  HBM (VRAM)     80-192GB  2-5 TB/s    Main GPU memory   │
├─────────────────────────────────────────────────────────┤
│  CPU RAM        512GB+    ~200 GB/s   System memory     │
├─────────────────────────────────────────────────────────┤
│  NVMe SSD       TB+       ~7 GB/s     Storage           │
└─────────────────────────────────────────────────────────┘

The 10-50x bandwidth gap between HBM and compute capability is why memory optimization dominates inference work.

Recommended Videos

MODULE 2

KV Cache: The Core Optimization

The KV cache is the single most important concept in LLM inference. It transforms O(n²) complexity to O(n) by caching intermediate computations.

What Gets Cached

In transformer attention, we compute Query (Q), Key (K), and Value (V) projections:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

For autoregressive generation, Q comes from the new token, but K and V from previous tokens don't change. Instead of recomputing them, we cache them.

How KV Caching Works

Step 1 (Prefill): Process "The quick brown"
  - Compute K₁, V₁ for "The"
  - Compute K₂, V₂ for "quick"
  - Compute K₃, V₃ for "brown"
  - Cache: [(K₁,V₁), (K₂,V₂), (K₃,V₃)]

Step 2 (Decode): Generate "fox"
  - Compute Q₄ for position 4
  - Retrieve cached K₁₋₃, V₁₋₃
  - Compute K₄, V₄ for "fox"
  - Attention: Q₄ attends to K₁₋₄
  - Cache: [(K₁,V₁), (K₂,V₂), (K₃,V₃), (K₄,V₄)]

Step 3 (Decode): Generate "jumps"
  - Only compute Q₅, K₅, V₅ for new position
  - Reuse all previous K, V from cache

Memory Calculation

KV Cache Size (bytes) = 2 × num_layers × num_kv_heads × head_dim × seq_len × precision_bytes

The factor of 2 accounts for both K and V. Let's calculate for Llama-3 70B:

ParameterLlama-3 70B Value
Layers80
KV Heads (GQA)8
Head Dimension128
Sequence Length8,192 (or 128K)
PrecisionFP16 (2 bytes)
8K context:  2 × 80 × 8 × 128 × 8192 × 2 = 2.68 GB per sequence
64K context: 2 × 80 × 8 × 128 × 65536 × 2 = 21.5 GB per sequence
128K context: 2 × 80 × 8 × 128 × 131072 × 2 = 43 GB per sequence

The Real Memory Bottleneck

An 80GB GPU running Llama-3 70B (140GB in FP16, quantized to ~35GB in INT8) leaves only 45GB for KV cache. With 128K context, you can serve approximately one concurrent request. This is why KV cache optimization is critical.

Attention Variants That Reduce KV Cache

Modern architectures reduce KV cache size by sharing heads:

Multi-Head Attention (MHA) - Original

  • Separate K, V projections per head
  • num_kv_heads = num_attention_heads
  • Maximum expressiveness, maximum memory

Multi-Query Attention (MQA)

  • Single shared K, V across all heads
  • num_kv_heads = 1
  • 8-16x KV cache reduction
  • Used by: PaLM, Falcon

Grouped-Query Attention (GQA)

  • Groups of heads share K, V
  • num_kv_heads = num_attention_heads / group_size
  • Balance between MHA and MQA
  • Used by: Llama 2/3, Mistral, Gemma

Multi-Latent Attention (MLA)

  • Low-rank compression of KV cache
  • Compresses K, V into latent space
  • 16x+ reduction possible
  • Used by: DeepSeek-V2, DeepSeek-V3
VariantKV Heads (70B)Cache per 8K seqReduction
MHA6421.5 GB1x (baseline)
GQA-882.68 GB8x
MQA10.34 GB64x

Recommended Videos

MODULE 3

PagedAttention & Memory Virtualization

PagedAttention, introduced by vLLM in 2023, revolutionized LLM serving by applying OS virtual memory concepts to GPU memory management.

The Problem: Memory Fragmentation

Traditional serving systems pre-allocate contiguous memory for each request's maximum sequence length:

Request 1: Allocate 8192 tokens × 2.68 GB = contiguous block
Request 2: Allocate 8192 tokens × 2.68 GB = contiguous block
...

Reality:
Request 1 uses 500 tokens  → 7692 tokens WASTED
Request 2 uses 2000 tokens → 6192 tokens WASTED

Studies found existing systems waste 60-80% of KV cache memory.

The Solution: Virtual Memory for GPUs

PagedAttention divides KV cache into fixed-size blocks, allocated on-demand:

OS ConceptPagedAttention Analog
Page (4KB)KV Block (16-32 tokens)
Virtual AddressLogical Block Index
Physical AddressGPU Memory Location
Page TableBlock Table
Page FaultBlock Allocation
Free Page Pool (LRU)Free Block Pool

How It Works

Physical GPU Memory:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │FREE│FREE│
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

Request A (needs 3 blocks):
  Block Table A: [0, 3, 5]  → Physical blocks B0, B3, B5

Request B (needs 2 blocks):
  Block Table B: [1, 4]     → Physical blocks B1, B4

Request C (needs 2 blocks):
  Block Table C: [2, 6]     → Physical blocks B2, B6

Blocks are NOT contiguous! Managed like virtual memory.

Block Structure

Each block stores KV cache for a fixed number of tokens:

Block Size: 16 tokens (typical)
Block Memory: 16 × (2 × num_layers × num_kv_heads × head_dim × precision)
            = 16 × 2.68GB/8192 = ~5.2 MB for Llama-3 70B GQA

Benefits

  • Near-zero waste: vLLM achieves <4% memory waste vs 60-80%
  • No external fragmentation: All blocks same size
  • Minimal internal fragmentation: Only last block partially filled
  • Dynamic allocation: Blocks allocated as sequence grows

Prefix Caching (Copy-on-Write)

Multiple requests with same prefix can share physical blocks:

System Prompt: "You are a helpful assistant..." (Block 0, 1)

Request A: System prompt + "What is Python?"
  Block Table: [0, 1, 2, 3]  ← Shares blocks 0,1

Request B: System prompt + "Explain machine learning"
  Block Table: [0, 1, 4, 5]  ← Shares blocks 0,1

Physical storage: Only ONE copy of system prompt blocks!
Reference counting ensures blocks freed when all requests done.

Automatic Prefix Caching (APC)

vLLM's APC automatically detects common prefixes across requests using content-based hashing. No configuration needed - system prompts, few-shot examples, and repeated context are cached automatically.

Block Size Trade-offs

Block SizeProsCons
Small (8 tokens)Less internal fragmentationMore kernel launch overhead, larger block tables
Medium (16 tokens)BalancedStandard choice
Large (32+ tokens)Better GPU utilization per kernelMore memory waste in last block

Recommended Videos

MODULE 4

Continuous Batching

Continuous batching (iteration-level scheduling) maximizes GPU utilization by dynamically managing requests at each decode step.

Static Batching Problems

Static Batch of 4 requests:
┌─────────────────────────────────────────────────────────┐
│ Req 1: ████████████████████████████████████ (1000 tokens)│
│ Req 2: ████████ (200 tokens) [PADDING...............]   │
│ Req 3: ████████████████ (400 tokens) [PADDING.......]   │
│ Req 4: ████ (100 tokens) [PADDING...................]   │
└─────────────────────────────────────────────────────────┘
          ↑ All must wait for longest request
          ↑ Padding wastes compute

Problems:

  • Short requests wait for long ones (head-of-line blocking)
  • Padding wastes compute on meaningless tokens
  • GPU utilization drops as requests finish
  • New requests must wait for entire batch to complete

Continuous Batching Solution

Time →
Step 1: [Req1, Req2, Req3, Req4] - all active
Step 2: [Req1, Req2, Req3, Req4] - all active
Step 3: [Req1, Req2, Req3, ----] - Req4 done, slot open
Step 4: [Req1, Req2, Req3, Req5] - New request joins!
Step 5: [Req1, ----, Req3, Req5] - Req2 done
Step 6: [Req1, Req6, Req3, Req5] - Another joins
...

Benefits:

  • Requests return immediately when done
  • New requests join without waiting
  • GPU stays fully utilized
  • No padding waste

Ragged Batching

To avoid padding, sequences are concatenated into a single "super-sequence":

Traditional (padded):
  Seq 1: [A, B, C, PAD, PAD]
  Seq 2: [D, E, F, G, H]
  Seq 3: [I, J, PAD, PAD, PAD]

Ragged (concatenated):
  Super-sequence: [A, B, C, D, E, F, G, H, I, J]
  Position IDs:   [0, 1, 2, 0, 1, 2, 3, 4, 0, 1]
  Sequence IDs:   [1, 1, 1, 2, 2, 2, 2, 2, 3, 3]

  Attention mask ensures each sequence only attends to itself.

Chunked Prefill

Long prompts can monopolize an iteration, delaying all other requests:

Head-of-Line Blocking

A 100K token prompt takes significant time to prefill. Without chunking, all decode requests wait, causing latency spikes.

Solution: Split long prefills into chunks, interleaved with decodes:

Without chunking:
  Step 1: Prefill 100K tokens (Request A) - BLOCKS EVERYONE
  Step 2: Decode all

With chunked prefill (chunk_size=8192):
  Step 1: Prefill 8K (Req A chunk 1) + Decode (Req B, C, D)
  Step 2: Prefill 8K (Req A chunk 2) + Decode (Req B, C, D)
  ...continues interleaved...

Scheduling Policies

PolicyDescriptionUse Case
FCFSFirst-come, first-servedFair, simple
SJFShortest job first (by estimated output)Minimize average latency
PriorityBased on SLO or customer tierMulti-tenant serving
PreemptivePause long requests for short onesStrict latency SLOs

Recommended Videos

MODULE 5

FlashAttention & Kernel Optimization

FlashAttention revolutionized attention computation by being IO-aware - minimizing HBM reads/writes through algorithmic restructuring.

Standard Attention Problem

Naive attention materializes large intermediate matrices in HBM:

1. S = Q @ K^T           # Shape: (seq_len, seq_len) - WRITE to HBM
2. P = softmax(S)        # READ S, WRITE P to HBM
3. O = P @ V             # READ P, WRITE O to HBM

For seq_len=8192: S and P are 8192×8192 = 268MB each!
Memory traffic: Multiple reads/writes of these large matrices.

FlashAttention Solution: Tiling

FlashAttention computes attention in tiles that fit in SRAM (shared memory):

Algorithm (simplified):
1. Divide Q, K, V into blocks that fit in SRAM
2. For each Q block:
   a. Load Q block to SRAM
   b. For each K, V block:
      - Load K, V blocks to SRAM
      - Compute partial attention in SRAM
      - Accumulate results (online softmax)
   c. Write final output block to HBM

Key insight: Never materialize full S or P matrices!
Only final output O written to HBM.

IO Complexity Reduction

Standard attention: O(N² + Nd) HBM accesses
FlashAttention: O(N²d² / M) where M is SRAM size
For typical configs, this is 5-20x fewer memory accesses.

Online Softmax

The trick that makes tiling work - compute softmax incrementally:

Standard softmax requires all values to compute denominator.
Online softmax maintains running max and sum:

For each new block of scores:
  1. Update running max: m_new = max(m_old, block_max)
  2. Rescale previous sum: sum_new = sum_old × exp(m_old - m_new)
  3. Add new contributions: sum_new += sum(exp(block - m_new))
  4. Update running output with correction factor

FlashAttention-2 Improvements

  • Better parallelism: Parallelize over sequence length, not just batch
  • Reduced non-matmul FLOPs: Fewer warp synchronizations
  • Better work partitioning: Improved occupancy
  • Result: 2x speedup over FlashAttention-1

FlashAttention-3 (Hopper)

  • Exploits H100 Tensor Memory Accelerator (TMA)
  • Asynchronous data movement overlapped with compute
  • FP8 support for further speedup
  • Result: 1.5-2x speedup over FA2 on H100

Skip Softmax (January 2026)

Many attention scores are near-zero and don't contribute meaningfully. Skip Softmax detects and skips these blocks:

Standard FlashAttention:
  For each KV block: compute attention (even if scores ≈ 0)

Skip Softmax:
  For each KV block:
    1. Quick check: are all scores below threshold?
    2. If yes: SKIP this block entirely
    3. If no: compute normally

Results: Up to 1.4x faster TTFT and TPOT
No retraining needed - drop-in replacement.

FlashInfer (MLSys 2025 Best Paper)

NVIDIA's open-source library for high-performance attention:

  • JIT-compiled kernels adapt to runtime settings
  • Block-sparse KV cache formats
  • Composable attention for complex scenarios
  • Direct integration with vLLM, SGLang

Recommended Videos

MODULE 6

Quantization for Inference

Quantization reduces precision of weights and activations, decreasing memory footprint and increasing arithmetic throughput.

Precision Formats Overview

FormatBitsExponent/MantissaRange
FP32328/23±3.4×10³⁸
FP16165/10±65504
BF16168/7±3.4×10³⁸ (FP32 range!)
FP8 E4M384/3±448 (weights)
FP8 E5M285/2±57344 (gradients)
INT88--128 to 127
INT44--8 to 7
NVFP4 E2M142/1±6

Why BF16 Over FP16

BF16 (Brain Float) uses same exponent bits as FP32, maintaining dynamic range while reducing precision:

FP32:  1 sign + 8 exponent + 23 mantissa = 32 bits
BF16:  1 sign + 8 exponent + 7 mantissa  = 16 bits
FP16:  1 sign + 5 exponent + 10 mantissa = 16 bits

BF16 advantage: Same range as FP32, simpler conversion
FP16 advantage: More precision within smaller range

Weight-Only Quantization

Quantize only weights, compute in higher precision:

  • INT4 AWQ (Activation-aware Weight Quantization): Preserves salient weights that handle important activations
  • GPTQ: Layer-by-layer quantization with error compensation
  • Benefit: 4x memory reduction for weights, minimal accuracy loss
W4A16: 4-bit weights, 16-bit activations
  - Weights stored in INT4 (4x compression)
  - Dequantized to FP16 before compute
  - Memory-bound scenarios benefit most

Full Quantization (Weights + Activations)

Quantize both for maximum throughput:

  • INT8 SmoothQuant: Mathematically smooths outliers between weights and activations
  • FP8: Native hardware support on H100/MI300X/Blackwell
  • W4A8: 4-bit weights, 8-bit activations - good balance

NVFP4: NVIDIA's 4-bit Float (Blackwell)

NVFP4 uses micro-block scaling for accuracy:

  • Values grouped into blocks of 16
  • Each block shares an FP8 (E4M3) scale factor
  • Additional per-tensor FP32 scale

Results: 3.5x memory reduction vs FP16, <1% accuracy loss on most tasks.

KV Cache Quantization

Quantizing KV cache separately provides additional savings:

KV PrecisionMemory vs FP16Accuracy Impact
FP161x (baseline)None
FP80.5xMinimal
NVFP40.25xSmall (<1%)

Quantization Best Practices

  1. Calibration data matters: Use representative samples from production workload
  2. Last layers are sensitive: Keep final 1-2 layers in higher precision
  3. Monitor per-task accuracy: Some tasks degrade more than others
  4. Layer-wise mixed precision: Quantize insensitive layers more aggressively
MODULE 7

NVIDIA Inference Stack

NVIDIA provides a comprehensive stack optimized for LLM inference: TensorRT-LLM, FlashInfer, and Triton Inference Server.

TensorRT-LLM Architecture

┌─────────────────────────────────────────────────────────┐
│                    User Application                      │
├─────────────────────────────────────────────────────────┤
│                  TensorRT-LLM Runtime                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐│
│  │ In-flight   │ │   Paged     │ │    Speculative      ││
│  │ Batching    │ │ KV Caching  │ │    Decoding         ││
│  └─────────────┘ └─────────────┘ └─────────────────────┘│
├─────────────────────────────────────────────────────────┤
│                   TensorRT Compiler                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐│
│  │   Graph     │ │   Kernel    │ │     Quantization    ││
│  │ Optimization│ │   Fusion    │ │     (FP8/FP4/INT)   ││
│  └─────────────┘ └─────────────┘ └─────────────────────┘│
├─────────────────────────────────────────────────────────┤
│                   Custom CUDA Kernels                    │
│  FlashAttention │ PagedAttention │ FP8 GEMM │ Fused Ops │
├─────────────────────────────────────────────────────────┤
│                      CUDA / cuBLAS                       │
└─────────────────────────────────────────────────────────┘

Key Optimizations

1. Kernel Fusion

TensorRT identifies and fuses adjacent operations:

Before fusion (3 kernel launches, 3 HBM round-trips):
  LayerNorm → HBM → FP8 Quantize → HBM → Linear

After fusion (1 kernel launch):
  LayerNorm + FP8 Quantize + Linear (fused)

Benefit: ~3x kernel time reduction for fused all-reduce + layernorm

2. In-flight Batching

Continuous batching implementation with configurable policies:

  • Max tokens in batch
  • Max sequences in batch
  • Scheduling policy (capacity, guaranteed_no_evict)

3. Speculative Decoding Support

  • Draft model: Smaller model proposes tokens
  • EAGLE-3: Learned draft head on target model
  • Multi-token prediction: Model predicts multiple future tokens
  • Medusa: Multiple parallel heads for speculation

Hardware: Blackwell Architecture

FeatureSpecification
Tensor Cores5th generation, native FP4/FP8
Precision SupportFP64, FP32, TF32, FP16, BF16, FP8, FP6, FP4
MemoryHBM3e, higher bandwidth than H100
InterconnectNVLink 5.0 (900 GB/s per GPU)
Transformer EngineAutomatic mixed precision management

NIM (NVIDIA Inference Microservices)

Pre-optimized containers for common models:

  • Optimized TensorRT-LLM engines pre-built
  • REST/gRPC APIs compatible with OpenAI format
  • Automatic batching and scaling
  • Multi-GPU support out of the box

Triton Inference Server

Production serving layer:

  • Model ensemble and pipeline support
  • Dynamic batching
  • Multi-framework support (TensorRT, PyTorch, ONNX)
  • Metrics and monitoring (Prometheus)
  • Model versioning and A/B testing
MODULE 8

AMD ROCm & MI300X

AMD's ROCm platform has matured significantly, achieving first-class support in vLLM as of January 2026.

MI300X Architecture

Hardware Specifications

  • Compute Dies: 8 XCDs (Accelerator Complex Dies)
  • Compute Units: 304 total (38 per XCD)
  • Memory: 192 GB HBM3 unified
  • Memory Bandwidth: 5.3 TB/s (higher than H100's 3.35 TB/s)
  • FP8 Support: Native via CDNA 3

Why MI300X Excels at Decode

Decode phase is memory-bandwidth-bound. MI300X's higher HBM bandwidth provides advantage:

GPUHBM BandwidthDecode Advantage
H100 SXM3.35 TB/sBaseline
MI300X5.3 TB/s+58% theoretical
H2004.8 TB/s+43%

Real-world results show MI300X matching or exceeding H100 for batch sizes 1-64 in decode-heavy workloads.

ROCm Software Stack

┌─────────────────────────────────────────────────────────┐
│                    vLLM / SGLang                         │
├─────────────────────────────────────────────────────────┤
│                   AITER Kernels                          │
│  (Attention, FP8 GEMM, Fused Ops)                       │
├─────────────────────────────────────────────────────────┤
│                    Triton / hipBLAS                      │
├─────────────────────────────────────────────────────────┤
│                       ROCm / HIP                         │
└─────────────────────────────────────────────────────────┘

vLLM on ROCm (January 2026 Status)

  • First-class platform: Pre-built Docker images on Docker Hub
  • Test coverage: 93% of test groups passing (up from 37% in Nov 2025)
  • No source builds needed: docker pull rocm/vllm-omni

Key Optimizations

OptimizationImpact
AITER FP8 kernelsNative FP8 compute
Fused LayerNorm/SiLU FP8Reduced memory traffic
Optimized PagedAttentionAssembly-level tuning
fastsafetensorsFaster model loading
Multimodal batch parallelism45% throughput gain on vision

GPU Partitioning

MI300X supports hardware-enforced partitioning for multi-tenant deployments:

# Run multiple vLLM instances on partitioned MI300X
# Each instance gets isolated GPU resources

Instance 1: 1/2 of MI300X (96GB, 152 CUs)
Instance 2: 1/2 of MI300X (96GB, 152 CUs)

# Or 4-way partition for smaller models
Instance 1-4: 1/4 of MI300X each (48GB, 76 CUs)

Known Limitations

MoE Regression

AITER has known issues with MoE models (Mixtral, DeepSeek-R1). For MoE workloads, use older image: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

RCCL vs NCCL

RCCL is AMD's collective communication library (NCCL equivalent):

  • Meta's DDA solutions show 10-50% speedup over baseline RCCL
  • Active development to close gap with NCCL
  • InfiniBand and RoCE support

Recommended Videos

MODULE 9

Parallelism Strategies

Scaling inference across multiple GPUs requires understanding when to use each parallelism strategy.

Tensor Parallelism (TP)

Splits individual weight matrices across GPUs. Each GPU holds a slice of every layer.

Single GPU:
  Linear layer: W (4096 × 4096) = 64MB

TP=4 (4 GPUs):
  GPU 0: W[:, 0:1024]    (4096 × 1024) = 16MB
  GPU 1: W[:, 1024:2048] (4096 × 1024) = 16MB
  GPU 2: W[:, 2048:3072] (4096 × 1024) = 16MB
  GPU 3: W[:, 3072:4096] (4096 × 1024) = 16MB

  Forward pass: Each GPU computes partial result
  All-reduce: Synchronize results across GPUs

Communication: All-reduce after every transformer block

Best for: Single node, high-bandwidth NVLink interconnect

Scaling limit: TP=8 typical max (communication overhead)

Pipeline Parallelism (PP)

Splits model vertically by layers. Each GPU holds consecutive layers.

PP=4 for 32-layer model:
  GPU 0: Layers 0-7   (embedding + first 8 transformer blocks)
  GPU 1: Layers 8-15
  GPU 2: Layers 16-23
  GPU 3: Layers 24-31 (+ output projection)

  Forward pass: Activations flow GPU0 → GPU1 → GPU2 → GPU3
  Backward pass: Gradients flow GPU3 → GPU2 → GPU1 → GPU0

Communication: Point-to-point between adjacent stages

Challenge: Pipeline bubbles reduce utilization

Best for: Multi-node, limited inter-node bandwidth

Expert Parallelism (EP) for MoE

Distributes experts across GPUs instead of replicating all experts everywhere.

MoE layer with 8 experts, EP=4:
  GPU 0: Expert 0, Expert 1
  GPU 1: Expert 2, Expert 3
  GPU 2: Expert 4, Expert 5
  GPU 3: Expert 6, Expert 7

  Forward pass:
  1. Router determines which experts each token needs
  2. All-to-all: Send tokens to GPUs with their experts
  3. Each GPU processes tokens for its experts
  4. All-to-all: Return results to original GPUs

Communication: All-to-all for token routing

Benefit: Memory efficient - each GPU stores subset of experts

Hybrid Parallelism

Production deployments combine strategies:

DeepSeek-R1 on 8 GPUs:
  TP=2: Split attention/FFN within each expert
  EP=4: Distribute 256 experts across 4 groups

  TensorRT-LLM config:
  convert_checkpoint.py --moe_tp_size 2 --moe_ep_size 4
ConfigurationUse Case
TP onlyDense models, single node
PP onlyVery deep models, multi-node
EP onlyMoE models, memory-constrained
TP + PPLarge dense models, multi-node
TP + EPLarge MoE models
TP + EP + PPExtremely large MoE (100B+)

Context Parallelism (CP)

For very long contexts, splits sequence across GPUs:

128K context with CP=4:
  GPU 0: Tokens 0-32K
  GPU 1: Tokens 32K-64K
  GPU 2: Tokens 64K-96K
  GPU 3: Tokens 96K-128K

  Attention uses ring-attention pattern for cross-chunk attention.
MODULE 10

Speculative Decoding

Speculative decoding accelerates generation by drafting multiple tokens in parallel, then verifying them efficiently.

Core Insight

Decode is memory-bandwidth-bound. Verifying K tokens costs similar to generating 1 token because we're reading the same weights regardless.

If draft acceptance rate = α, speedup ≈ (α × K + 1) / (draft_cost + verify_cost)

Algorithm

1. DRAFT: Small model generates K candidate tokens
   draft_tokens = [t₁, t₂, t₃, t₄, t₅]  (K=5)

2. VERIFY: Target model evaluates all K tokens in ONE forward pass
   target_logits = target_model([context + draft_tokens])

3. ACCEPT/REJECT: Compare distributions
   For i in 0..K:
     if sample(target_logits[i]) matches draft_tokens[i]:
       ACCEPT token i
     else:
       REJECT, resample from target, stop

4. BONUS: Always get at least 1 token (resampled if all rejected)

Example:
  Draft: [the, quick, brown, fox, jumped]
  Target accepts: [the, quick, brown] ✓
  Target rejects: [fox] ✗ (target wanted "red")
  Output: [the, quick, brown, red] = 4 tokens from 1 verify pass!

Draft Model Options

ApproachDescriptionPros/Cons
Smaller modelLlama-7B drafts for Llama-70BSimple, requires separate model
EAGLELearned draft head on targetHigh acceptance, trained on target
MedusaMultiple parallel headsNo separate model needed
Self-speculationEarly exit from target layersSingle model, lower acceptance
LookaheadN-gram based predictionTraining-free, works on any model

EAGLE-3 (TensorRT-LLM)

State-of-the-art learned speculation:

  • Lightweight draft head trained on target model outputs
  • Tree-structured speculation (multiple branches)
  • Acceptance rates 70-90% for typical queries
  • Integrated in TensorRT-LLM runtime

When Speculative Decoding Helps

γ-Tolerance (ICLR 2026)

Research shows memory bandwidth remains the bottleneck even at large batch sizes. The "γ-tolerance" criterion characterizes when speculation helps:

  • Low batch size: Almost always helps
  • High batch size: Still helps if draft is fast enough
  • Key metric: draft_latency / verify_latency ratio

Multi-Token Prediction

Some models trained to predict multiple future tokens natively:

Standard LM head: P(t_{n+1} | t_1..t_n)

Multi-token head:
  P(t_{n+1} | t_1..t_n)
  P(t_{n+2} | t_1..t_n)  # Direct prediction, not autoregressive!
  P(t_{n+3} | t_1..t_n)
  ...

Benefit: Native speculation without separate draft model.
MODULE 11

Disaggregated Prefill/Decode

Disaggregated serving separates prefill and decode onto different hardware pools for independent optimization and scaling.

Why Disaggregate?

Prefill and decode have fundamentally different resource needs:

CharacteristicPrefillDecode
Bound byCompute (FLOPS)Memory bandwidth
Batch efficiencyHigh (parallel tokens)Low (sequential)
Latency sensitivityTTFTTPOT, ITL
Ideal hardwareHigh FLOPS (H100)High bandwidth (MI300X)
Scaling dimensionPrompt lengthConcurrent users

Architecture

                    ┌─────────────────────┐
                    │    Load Balancer    │
                    └──────────┬──────────┘
                               │
              ┌────────────────┴────────────────┐
              ▼                                 ▼
    ┌─────────────────────┐          ┌─────────────────────┐
    │   Prefill Cluster   │          │   Decode Cluster    │
    │   (H100 / H200)     │          │   (MI300X / A100)   │
    │                     │          │                     │
    │  - Process prompts  │ KV Cache │  - Generate tokens  │
    │  - High FLOPS       │────────▶│  - High bandwidth   │
    │  - Compute-bound    │ Transfer │  - Memory-bound     │
    └─────────────────────┘          └─────────────────────┘

KV Cache Transfer

The critical challenge: moving KV cache from prefill to decode cluster quickly.

Transfer Overhead

If KV transfer takes longer than compute savings, disaggregation hurts performance. Solutions must support fast, low-latency protocols.

vLLM KV transfer backends:

  • NIXLConnector: NVIDIA Inference Xfer Library
  • UCX: Unified Communication X
  • libfabric: OpenFabrics interfaces
  • EFA: AWS Elastic Fabric Adapter

Industry Adoption (2026)

Disaggregation is now standard in production:

  • Frameworks: NVIDIA Dynamo, llm-d, Ray Serve LLM, SGLang, vLLM, LMCache
  • Companies: Fireworks AI, Perplexity, Meta, Amazon, Modular, DeepInfra

Performance Results

  • Up to 6.4x throughput improvement
  • 20x reduction in latency variance
  • 15-40% infrastructure cost reduction

When to Disaggregate

  1. High scale: Thousands of concurrent users
  2. Variable workloads: Mix of long prompts and many short outputs
  3. Strict SLOs: Need to optimize TTFT and TPOT independently
  4. Heterogeneous hardware: Mix of GPU types available
MODULE 12

GPU Communication & Collective Operations

Inter-GPU communication often determines scaling efficiency. Understanding NCCL and alternatives is critical.

Collective Operations

OperationDescriptionUse Case
All-ReduceSum/average across all GPUs, result everywhereTensor parallelism sync
All-GatherGather data from all GPUs to all GPUsSequence parallelism
Reduce-ScatterReduce then scatter resultZeRO optimizer
All-to-AllEach GPU sends different data to each GPUExpert parallelism routing
BroadcastOne GPU sends to all othersWeight synchronization

All-Reduce: The TP Bottleneck

In tensor parallel inference, all-reduce synchronizes after every transformer block:

8-way TP forward pass:
  Linear 1 (parallel) → All-Reduce → Activation → Linear 2 (parallel) → All-Reduce

Each all-reduce: ~23% of decode latency on 8×H100!

NCCL (NVIDIA Collective Communications Library)

  • Optimized for NVLink intra-node communication
  • Ring and tree algorithms for different message sizes
  • Scales poorly for small messages across nodes
  • Standard choice for NVIDIA GPUs

Optimization Strategies

1. Custom Single-Shot All-Reduce

Standard ring all-reduce: 2×(N-1) communication steps
Single-shot: Each rank aggregates from all peers in one stage

Result: ~3x kernel time speedup when fused with LayerNorm
       ~27% end-to-end decode improvement

2. Kernel Fusion

Before: All-Reduce → LayerNorm → Add (3 kernels)
After:  Fused All-Reduce + LayerNorm + Add (1 kernel)

Eliminates intermediate HBM round-trips.

3. Communication Compression

Flash All-Reduce with INT4 quantization:
  - Quantize tensors before communication
  - Transfer 4x less data
  - Dequantize after receive

Result: 3.18x latency reduction for >64MB messages

NVRAR (Research Alternative)

Hierarchical all-reduce based on recursive doubling with NVSHMEM:

  • 1.9-3.6x lower latency than NCCL for 128KB-2MB
  • 1.72x reduction in end-to-end batch latency for Llama 3.1 405B

GPUDirect RDMA

Bypassing CPU

GPUDirect RDMA allows network adapters (InfiniBand, RoCE) to directly access GPU memory without CPU involvement:

  • Eliminates CPU memory copy
  • Reduces latency by microseconds
  • Critical for multi-node tensor parallelism

Unified Memory (Grace Hopper/Blackwell)

NVLink-C2C creates coherent CPU-GPU memory:

  • Bandwidth: 900 GB/s between CPU and GPU
  • Capacity: GH200 = 96GB HBM + 480GB LPDDR unified
  • Benefit: Models larger than GPU memory "just work"

NCCLX (Meta, 100K+ GPUs)

Extended NCCL for extreme scale:

  • Supports 100,000+ GPU collective operations
  • Developed for Llama 4 training/inference
  • Optimizations for both throughput and latency
CHEAT SHEET

Key Topics for System Design

Distilled essentials for LLM inference & training system design interviews. These are the concepts that matter most.

1. Prefill vs Decode: The Two Phases

Prefill (prompt processing): Compute-bound, processes all input tokens in parallel, generates KV cache. Decode (generation): Memory-bandwidth-bound, generates one token at a time, reads entire KV cache per token. Most optimization effort targets decode phase.

2. Memory Hierarchy: HBM → SRAM → Registers

HBM: 80-192GB, 2-5 TB/s bandwidth (main memory). SRAM: ~50MB per GPU, 100+ TB/s (L2 cache + SM shared memory). The goal is to keep data in SRAM as long as possible. FlashAttention exists because HBM bandwidth is the bottleneck.

3. KV Cache: The Memory Dominator

Formula: 2 × layers × kv_heads × head_dim × seq_len × precision_bytes. For Llama-70B with 8K context: ~2.6GB per sequence. KV cache often exceeds model weights in memory usage for long contexts. This is why PagedAttention and context length matter so much.

4. Continuous Batching: No Head-of-Line Blocking

Static batching wastes GPU cycles waiting for longest sequence. Continuous batching inserts new requests as others complete. Key insight: Iteration-level scheduling, not request-level. vLLM, TensorRT-LLM, and all modern serving systems use this.

5. FlashAttention: Tiling for Memory Efficiency

Standard attention materializes O(N²) attention matrix in HBM. FlashAttention tiles Q, K, V into blocks that fit in SRAM, computes partial softmax with online correction. Result: No O(N²) memory, 2-4x speedup. Now the default in all inference engines.

6. PagedAttention: Virtual Memory for KV Cache

Problem: Variable sequence lengths cause fragmentation. Solution: Allocate KV cache in fixed blocks (like OS pages), use block table for indirection. Achieves near-zero fragmentation, enables memory sharing for beam search. Core innovation behind vLLM.

7. Quantization: Trading Precision for Speed

Weight-only quantization (INT4/INT8 weights, FP16 activations): Reduces memory, speeds up memory-bound decode. Full quantization (FP8/INT8 everything): Faster compute on Tensor Cores. Key trade-off: accuracy vs throughput. AWQ, GPTQ for weights; FP8 on Hopper/Blackwell for full.

8. Tensor Parallelism: Splitting Within Layers

Splits attention heads and FFN columns across GPUs. Each GPU holds 1/N of weights, computes partial results, all-reduce to combine. Low latency (single forward pass), but requires fast interconnect (NVLink). Use for latency-sensitive serving, typically 2-8 GPUs.

9. Pipeline Parallelism: Splitting Across Layers

Different GPUs hold different layers. Data flows through pipeline, micro-batching hides bubble overhead. Lower communication than TP, but higher latency. Use for training or when model doesn't fit with TP alone. Combine with TP for very large models.

10. Speculative Decoding: Guess and Verify

Draft model generates K candidate tokens cheaply, target model verifies in single forward pass (parallel!). If correct, accept all K; if wrong, accept up to first mismatch. 2-3x speedup for latency without quality loss. Works because verification is parallel, generation is sequential.

11. Disaggregated Serving: Prefill ≠ Decode

Prefill is compute-bound, decode is memory-bound — why run both on same hardware? Disaggregated architecture: Prefill nodes with high compute, decode nodes with high memory bandwidth. Transfer KV cache between them. Emerging pattern for large-scale serving (Mooncake, DistServe).

12. Arithmetic Intensity: The Core Metric

AI = FLOPs / Bytes moved. Compare to hardware's ops:byte ratio (H100: ~500 for FP16). If AI < ratio → memory-bound. Decode AI ≈ 1-2 (one token, read all weights), prefill AI ≈ batch_size × seq_len. This explains why decode is always memory-bound.

13. Hardware: NVIDIA vs AMD vs Custom

H100: 80GB HBM3, 3.35 TB/s, NVLink 900 GB/s, dominant ecosystem. MI300X: 192GB HBM3, 5.3 TB/s, better memory, weaker software. Blackwell (B200): FP4 support, 2x H100 performance. Know the specs, but software maturity often matters more.

14. Serving Frameworks: vLLM vs TensorRT-LLM

vLLM: PagedAttention, Python, flexible, great for research/startups. TensorRT-LLM: NVIDIA optimized, compiled graphs, best raw performance. SGLang: RadixAttention for prefix sharing. Choice depends on: flexibility vs performance, NVIDIA-only vs multi-hardware.

15. Training at Scale: Data, Tensor, Pipeline, Expert

Data Parallel (DP/FSDP): Replicate model, split data, gradient sync. Tensor Parallel: Split ops. Pipeline Parallel: Split layers. Expert Parallel: MoE routing. Modern training uses 3D/4D parallelism combining all. FSDP shards optimizer states across DP ranks for memory efficiency.

Interview Quick Hits

  • Why is decode slow? Memory-bandwidth-bound: read all weights for just 1 token.
  • Why KV cache? Avoid recomputing attention for all previous tokens each step.
  • Why batching helps? Amortize weight loading across multiple sequences.
  • Why TP over DP for inference? Lower latency (single request), not throughput-focused.
  • Why FP8 over INT8? No calibration needed, direct training support on Hopper+.
  • Why speculative works? Verification is parallel (batch of K), generation is serial.
  • Why disaggregate? Match hardware to workload characteristics (compute vs memory).

Knowledge Check

Test your understanding of LLM inference concepts. Select the best answer for each question.

Question 1: Inference Phases

During LLM inference, which phase is typically memory-bandwidth-bound?

  • A) Prefill phase
  • B) Decode phase
  • C) Both phases equally
  • D) Neither - LLM inference is compute-bound

Question 2: KV Cache Memory

For a 70B parameter model with GQA (8 KV heads), 80 layers, 128 head dimension, and FP16 precision, approximately how much KV cache memory is needed per 8K token sequence?

  • A) ~500 MB
  • B) ~2.5-3 GB
  • C) ~10 GB
  • D) ~20 GB

Question 3: PagedAttention

What is the primary benefit of PagedAttention's block-based memory management?

  • A) Faster attention computation
  • B) Reduced model size
  • C) Near-elimination of memory fragmentation
  • D) Lower precision requirements

Question 4: Continuous Batching

What problem does continuous batching solve that static batching cannot?

  • A) Reduces model memory footprint
  • B) Eliminates head-of-line blocking when requests finish at different times
  • C) Enables larger batch sizes
  • D) Improves attention accuracy

Question 5: FlashAttention

FlashAttention achieves its speedup primarily by:

  • A) Using lower precision arithmetic
  • B) Approximating attention scores
  • C) Tiling computation to minimize HBM reads/writes
  • D) Skipping unimportant tokens

Question 6: Quantization

NVIDIA's NVFP4 format on Blackwell achieves accuracy preservation through:

  • A) Per-tensor scaling only
  • B) Micro-block scaling with 16-element groups sharing FP8 scale factors
  • C) Training-aware quantization
  • D) Mixed INT4/INT8 representation

Question 7: AMD MI300X

Why does AMD MI300X often match or exceed H100 performance in the decode phase?

  • A) Higher compute FLOPS
  • B) Higher HBM memory bandwidth (5.3 TB/s vs 3.35 TB/s)
  • C) More CUDA cores
  • D) Better software optimization

Question 8: Tensor Parallelism

In tensor parallelism, what collective operation is performed after each transformer block?

  • A) All-Reduce
  • B) All-Gather
  • C) All-to-All
  • D) Broadcast

Question 9: Expert Parallelism

Expert parallelism for MoE models uses which collective operation to route tokens to the correct GPUs?

  • A) All-Reduce
  • B) Broadcast
  • C) All-to-All
  • D) Reduce-Scatter

Question 10: Speculative Decoding

Why does speculative decoding work even though it requires running two models?

  • A) The draft model is much more accurate
  • B) Verifying K tokens costs similar to generating 1 token (memory-bound)
  • C) Modern GPUs have spare compute capacity
  • D) The models share weights

Question 11: Disaggregated Serving

What is the main challenge in disaggregated prefill/decode architecture?

  • A) Prefill cluster running out of memory
  • B) Transferring KV cache from prefill to decode cluster efficiently
  • C) Load balancing between clusters
  • D) Different model versions on each cluster

Question 12: Communication Optimization

GPUDirect RDMA improves multi-node inference by:

  • A) Compressing data before transfer
  • B) Using faster network protocols
  • C) Allowing network adapters to access GPU memory directly without CPU involvement
  • D) Caching frequently transferred data

Question 13: GQA vs MHA

Grouped-Query Attention (GQA) reduces KV cache size by:

  • A) Using lower precision for K and V
  • B) Having groups of query heads share the same K and V heads
  • C) Compressing K and V with learned projections
  • D) Caching only the most important tokens

Question 14: TTFT vs TPOT

Which optimization primarily improves Time to First Token (TTFT)?

  • A) Chunked prefill
  • B) Speculative decoding
  • C) Continuous batching
  • D) KV cache quantization

Question 15: Memory Hierarchy

The primary bottleneck in LLM decode is the bandwidth gap between:

  • A) Registers and shared memory
  • B) Shared memory and L2 cache
  • C) HBM and compute capability (arithmetic intensity)
  • D) CPU RAM and GPU HBM

Addendum: VP of Engineering Interview Guide

Microsoft executive interviews focus on growth mindset, leadership at scale, and technical depth. Here's what to expect for a VP of Engineering role.

Microsoft Interview Process

  • HR Screen — 30-45 min call focusing on background, motivation, and culture fit
  • Hiring Manager Screen — Technical depth + leadership philosophy
  • Onsite Loop (4-6 rounds) — Mix of technical, behavioral, and leadership scenarios
  • Executive Interview — CVP or higher assessing strategic alignment

Each interviewer grades independently: Strong Hire / Hire / No Hire / Strong No Hire. You typically need "Hire" from all interviewers.

🧠 Growth Mindset (Critical)

Under Satya Nadella, Microsoft's culture centers on growth mindset—the belief that abilities can be developed through dedication and hard work. Demonstrate this throughout:

  • Share examples of learning from failures
  • Show curiosity about new technologies and approaches
  • Discuss how you've evolved your leadership style
  • Be open to feedback during the interview itself
  • Avoid "know-it-all" energy—emphasize continuous learning

Interview Question Categories

Strategic Thinking

  • "Let's pretend we're 3 years into the future and our company has gone out of business. What are some possible causes?"
  • "What does your current business work? How does engineering contribute to revenue?"
  • "We want to hire N engineers in the next 12 months. Walk me through your plan."
  • "How do you balance innovation with operational excellence?"

Leadership & Culture

  • "How do you build an engineering culture? What do you value?"
  • "Tell me about a time you disagreed with the CEO/CTO. What was the outcome?"
  • "What do your direct reports like about you? What don't they like?"
  • "Give me an example of tough feedback you received. What did you do about it?"
  • "What's the hardest part of your job today? What do you like least about it?"
  • "Tell me about a time when someone changed your mind. What was your thought process?"

Engineering Organization

  • "How do you organize teams (cross-functional vs. functional)? What's your rationale?"
  • "What would you do differently at 50 people vs. 200 people?"
  • "What does your ideal career ladder look like?"
  • "How do you ensure best practices are shared across teams?"
  • "When does it make sense to build an international presence?"

Technical Depth

  • "What's the most challenging technical project you ran? Describe in detail."
  • "Tell me about a time you had to go down several layers to understand a problem."
  • "What do you think about CI/CD? What's the right way to do it?"
  • "How have you evolved releases for an engineering org?"
  • "What key metrics do you use to measure org health?"

Execution & Operations

  • "How do you track and communicate progress of projects?"
  • "Do you measure engineering velocity? How do you communicate it externally?"
  • "What's your ideal process for managing and communicating incidents?"
  • "Your team is experiencing a lot of incidents and burnout. How do you approach this?"
  • "How do you track and forecast infrastructure costs?"

Scenario-Based Questions

  • "Your CEO wants to sign a customer who needs a feature by a certain date. Your team is complaining about priority changes and heavy backlog. What's your process?"
  • "Engineers are debating architecture: one moves faster with tech debt, one is right long-term but slower. How do you decide?"
  • "What would your first 90 days look like if you joined?"
  • "You have reservations about a hiring decision your manager is making. What do you do?"
  • "The CEO wants to prioritize a new product you have reservations about. How do you convince the team? How do you fail fast?"

💡 Microsoft-Specific Tips

Use STAR Method

Structure answers as Situation, Task, Action, Result. Microsoft interviewers are trained to look for specific, measurable outcomes.

Know "Why Microsoft?"

This will be asked. Focus on what you'll achieve AND what Microsoft gains. Reference Azure, AI leadership, or developer ecosystem.

Show Inclusive Leadership

Microsoft values diverse perspectives. Share examples of building inclusive teams and creating psychological safety.

Customer Obsession

Connect your work to customer impact. How did your technical decisions improve user experience or business outcomes?

Own Your Gaps

If you don't know something, say so—then explain how you'd learn it. This demonstrates growth mindset better than bluffing.

Prepare Questions

Have thoughtful questions about the team's challenges, roadmap, and culture. "What keeps you up at night?" shows strategic thinking.

⚠️ Common Pitfalls

  • Fixed mindset signals — Blaming others, not acknowledging mistakes, dismissing new ideas
  • Vague answers — Always give specific examples with measurable outcomes
  • Badmouthing past employers — Even valid criticism should be constructive
  • Not asking questions — Shows lack of genuine interest or strategic thinking
  • Over-indexing on IC work — At VP level, focus on leverage through others

Sources: Microsoft Careers, Glassdoor, Prepfully, IGotAnOffer, ksindi/interview-questions