Contents
1. Requirements Gathering
Before diving into architecture, clarify these with the interviewer:
Functional Requirements
- Model size? (7B, 70B, 405B parameters)
- Context length? (4K, 32K, 128K+ tokens)
- Chat completion or batch inference?
- Streaming responses required?
Non-Functional Requirements
- Latency SLO: Time to First Token (TTFT), Inter-Token Latency (ITL)
- Throughput: Tokens per second, concurrent users
- Availability: 99.9%? 99.99%?
- Cost constraints: $/1M tokens target
Example Clarification
"Let's design for a 70B parameter model, 32K context, chat completion with streaming. Target: TTFT < 500ms, ITL < 50ms, 1000 concurrent users, 99.9% availability."
2. High-Level Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOAD BALANCER │
│ (Route by model, session affinity) │
└─────────────────────────────────┬───────────────────────────────────────────┘
│
┌─────────────────────────────────▼───────────────────────────────────────────┐
│ API GATEWAY │
│ • Authentication • Rate limiting • Request validation │
│ • Token counting • Cost tracking • Request queuing │
└─────────────────────────────────┬───────────────────────────────────────────┘
│
┌─────────────────────────────────▼───────────────────────────────────────────┐
│ REQUEST SCHEDULER │
│ • Continuous batching • Priority queues (SLO-aware) │
│ • Prefix cache lookup • Load balancing across workers │
└──────────────┬──────────────────────────────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ PREFILL WORKERS │ │ DECODE WORKERS │
│ (Compute-optimized GPUs) │ │ (Memory-optimized GPUs) │
│ │ │ │
│ ┌────────┐ ┌────────┐ │ │ ┌────────┐ ┌────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ │ │ GPU 0 │ │ GPU 1 │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ TP = 2 │ │ TP = 2 │ │ │ │ TP = 2 │ │ TP = 2 │ │
│ └────┬───┘ └───┬────┘ │ │ └────┬───┘ └───┬────┘ │
│ └────┬────┘ │ │ └────┬────┘ │
│ │ NVLink │ │ │ NVLink │
│ ┌────────┐ ┌────────┐ │ │ ┌────────┐ ┌────────┐ │
│ │ GPU 2 │ │ GPU 3 │ │ │ │ GPU 2 │ │ GPU 3 │ │
│ └────────┘ └────────┘ │ │ └────────┘ └────────────────┘
└──────────────┬───────────────┘ └──────────────┬───────────────┘
│ │
└────────────────┬─────────────────────────┘
│
┌───────────────────────────────▼─────────────────────────────────────────────┐
│ KV CACHE TRANSFER LAYER │
│ • RDMA / NVLink for low-latency transfer │
│ • Distributed KV store for large-scale │
└─────────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | Key Metrics |
|---|---|---|
| Load Balancer | Route requests, health checks, session affinity for streaming | Request latency, error rate |
| API Gateway | Auth, rate limiting, request validation, token counting | Requests/sec, rejection rate |
| Scheduler | Batch formation, prefix cache lookup, worker assignment | Queue depth, batch efficiency |
| Prefill Workers | Process input prompt, generate KV cache | TTFT, prefill tokens/sec |
| Decode Workers | Autoregressive token generation | ITL, decode tokens/sec |
3. Parallelism Strategies
Tensor Parallelism (TP)
Split individual layers across GPUs. Each GPU holds a slice of every layer's weights.
Single Layer with TP=4
┌─────────────────────────────────────────────────┐
│ Input Tensor │
└─────────────────────────┬───────────────────────┘
│ Scatter
┌────────────────────┼────────────────────┐
│ │ │ │ │
┌────▼────┐ ┌───▼────┐ ┌──▼───┐ ┌───▼────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2│ │ GPU 3 │
│ W[0:25%]│ │W[25:50%]│ │W[50:75%]│ │W[75:100%]│
└────┬────┘ └───┬────┘ └──┬───┘ └───┬────┘
│ │ │ │
└──────────┴────┬────┴─────────┘
│ All-Reduce (NVLink)
▼
┌─────────────────────┐
│ Output Tensor │
└─────────────────────┘
When to Use TP
- Model doesn't fit in single GPU memory
- Low latency requirements (all GPUs work on same token)
- Within NVLink domain (high bandwidth all-reduce needed)
- Typical: TP=2, 4, or 8 within a single node
Pipeline Parallelism (PP)
Assign different layer groups to different GPUs/nodes. Data flows through stages.
70B Model with PP=4
Node 0 Node 1 Node 2 Node 3
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Layers │ ──────► │Layers │ ──────► │Layers │ ──────► │Layers │
│ 0-19 │ │ 20-39 │ │ 40-59 │ │ 60-79 │
│ │ │ │ │ │ │ │
│ Stage 0 │ │ Stage 1 │ │ Stage 2 │ │ Stage 3 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
────────────────────────────────────────────────────────────────────────►
Time / Token Flow
Pipeline Bubbles
PP has inherent inefficiency: when Stage 0 processes request B, Stages 1-3 are still processing request A. This creates "bubbles" of GPU idle time. Mitigate with micro-batching.
Hybrid Parallelism (TP + PP)
405B Model: TP=8 within node, PP=4 across nodes
┌─────────────────────────────────────────────────────────────────────┐
│ Node 0 (Layers 0-25) │
│ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐ │
│ │GPU 0 │GPU 1 │GPU 2 │GPU 3 │GPU 4 │GPU 5 │GPU 6 │GPU 7 │ │
│ │ │ │ │ │ │ │ │ │ │
│ └───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┘ │
│ └───────┴───────┴───────┴───┬───┴───────┴───────┴───────┘ │
│ NVLink │ (900 GB/s) │
└─────────────────────────────────┼───────────────────────────────────┘
│ InfiniBand (400 Gbps)
┌─────────────────────────────────▼───────────────────────────────────┐
│ Node 1 (Layers 26-50) │
│ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐ │
│ │GPU 0 │GPU 1 │GPU 2 │GPU 3 │GPU 4 │GPU 5 │GPU 6 │GPU 7 │ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
(Nodes 2, 3...)
| Strategy | Best For | Communication | Limitation |
|---|---|---|---|
| Tensor Parallelism | Low latency, within node | All-reduce every layer | Requires high bandwidth (NVLink) |
| Pipeline Parallelism | Cross-node, memory-limited | Point-to-point between stages | Pipeline bubbles |
| Data Parallelism | High throughput | None (inference) | Model must fit per replica |
| Expert Parallelism | MoE models (Mixtral) | All-to-all routing | Load imbalance |
4. Memory Management (KV Cache)
The Problem
For each token generated, we must store Key and Value tensors for all previous tokens, across all layers and attention heads. This dominates GPU memory.
KV Cache Size Calculation
For Llama 70B with 80 layers, 64 heads, 128 head_dim, 2048 context at FP16:
2 (K+V) × 80 layers × 64 heads × 128 dim × 2048 tokens × 2 bytes = ~84 GB per sequence
PagedAttention (vLLM Innovation)
Traditional KV Cache PagedAttention
───────────────────── ─────────────────────
┌─────────────────────┐ ┌───┬───┬───┬───┬───┐
│ Seq 1: 1500 tokens │ │ P1│ P2│ P3│ P4│ P5│ ← Seq 1 (pages)
│ [allocated: 2048] │ └───┴───┴───┴───┴───┘
│ [wasted: 548] │
└─────────────────────┘ ┌───┬───┬───┐
┌─────────────────────┐ │ P6│ P7│ P8│ ← Seq 2
│ Seq 2: 800 tokens │ └───┴───┴───┘
│ [allocated: 2048] │
│ [wasted: 1248] │ ┌───┬───┬───┬───┬───┬───┐
└─────────────────────┘ │ P9│P10│P11│P12│P13│P14│ ← Seq 3
┌─────────────────────┐ └───┴───┴───┴───┴───┴───┘
│ Seq 3: 2000 tokens │
│ [allocated: 2048] │ Pages allocated on demand
│ [wasted: 48] │ No external fragmentation
└─────────────────────┘ Can share prefix pages!
Total waste: ~45% Total waste: <5%
Prefix Caching
When multiple requests share a common prefix (system prompt, few-shot examples), cache and reuse the KV for that prefix.
System Prompt: "You are a helpful assistant..." (500 tokens)
Request 1: [System Prompt] + "What is 2+2?"
Request 2: [System Prompt] + "Explain quantum computing"
Request 3: [System Prompt] + "Write a poem"
┌─────────────────────────────┐
│ Cached Prefix KV │ ← Computed once
│ (500 tokens) │
└──────────────┬──────────────┘
│ Shared
┌─────────┼─────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Req 1 │ │ Req 2 │ │ Req 3 │
│ unique │ │ unique │ │ unique │
└─────────┘ └─────────┘ └─────────┘
Savings: 3× reduction in prefill compute for shared prefix
5. Batching Strategies
Static vs Continuous Batching
STATIC BATCHING
═══════════════════════════════════════════════════════════════════
Time → ─────────────────────────────────────────────────────────►
Batch 1 ████████████████████████████████████████████████
│ Req A (100 tokens) │ Req B (50 tokens) │ Req C (200 tokens)
│ │ ░░░░░░░░░░░░░░░░░░░ │ (waiting)
│ │ (padding) │
└────────────────────┴──────────────────────┘
▲
│
All must wait for longest
CONTINUOUS BATCHING
═══════════════════════════════════════════════════════════════════
Time → ─────────────────────────────────────────────────────────►
████████████████████████████████████████████████████████
Req A │ ████████████████████ │ (done, exits batch)
Req B │ ██████████ │ (done) │
Req C │ │ █████████████████████████████████████████
Req D │ │ │ ████████████████████████████████
Req E │ │ │ │ ████████████████████████
└────────────┴─────────┴─────────┴────────────────────────
▲ ▲ ▲
│ │ │
New requests join as slots free up
| Aspect | Static Batching | Continuous Batching |
|---|---|---|
| Latency | Higher (wait for batch + longest seq) | Lower (immediate insertion) |
| Throughput | Lower (padding waste) | Higher (no padding) |
| Memory | Fixed, predictable | Dynamic, needs careful management |
| Complexity | Simple | Complex scheduling |
Industry Standard (2025)
Continuous batching is now default in all major serving frameworks (vLLM, TensorRT-LLM, TGI). Static batching is only used for specific batch inference workloads.
6. Prefill/Decode Disaggregation
Why Disaggregate?
| Phase | Characteristic | Bottleneck | Optimal Hardware |
|---|---|---|---|
| Prefill | Process all input tokens in parallel | Compute (matrix multiply) | High FLOPS (H100, TPU) |
| Decode | Generate one token at a time | Memory bandwidth | High HBM bandwidth (H200, MI300X) |
DISAGGREGATED ARCHITECTURE
═══════════════════════════════════════════════════════════════════
┌─────────────────────┐
│ Request Router │
└──────────┬──────────┘
│
┌─────────────────┴─────────────────┐
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ PREFILL CLUSTER │ │ DECODE CLUSTER │
│ │ │ │
│ • H100 GPUs │ │ • H200 / MI300X │
│ • High FLOPS │ KV Cache │ • High HBM BW │
│ • Fewer GPUs │ ──────────► │ • More GPUs │
│ • Process prompts │ Transfer │ • Token generation │
│ │ │ │
└─────────────────────┘ └─────────────────────┘
Benefits:
• Scale prefill and decode independently
• Match hardware to workload characteristics
• Better overall resource utilization
Challenges:
• KV cache transfer latency (need RDMA/NVLink)
• Orchestration complexity
• 20-30% overhead for small requests
7. Speculative Decoding
Use a small "draft" model to predict multiple tokens, then verify with the target model in parallel.
SPECULATIVE DECODING FLOW
═══════════════════════════════════════════════════════════════════
Draft Model (7B) Target Model (70B)
──────────────── ─────────────────
Step 1: Draft predicts K=4 tokens
"The capital of France" → ["is", "Paris", ",", "which"]
t1 t2 t3 t4
Step 2: Target verifies ALL 4 in parallel (single forward pass)
Target checks:
• P(is | context) > threshold? ✓ Accept
• P(Paris | context + is) > thresh? ✓ Accept
• P(, | context + is + Paris)? ✓ Accept
• P(which | ... )? ✗ Reject → Generate "a"
Step 3: Accept longest matching prefix + generate next token
Output: "is", "Paris", ",", "a" (4 tokens from 2 forward passes!)
─────────────────────────────────────────────────────────────────
Traditional: 4 tokens = 4 forward passes
Speculative: 4 tokens = 2 forward passes (draft + verify)
Speedup: Up to 2-3x when draft model matches well
When Speculative Decoding Helps
- Target model is memory-bandwidth bound (decode phase)
- Draft model has high acceptance rate (>70%)
- Draft model is 5-10x smaller than target
Trade-off: Increases compute (running two models) to reduce latency. Not helpful if already compute-bound.
8. Scaling Considerations
| Scale | Primary Challenge | Solution Pattern |
|---|---|---|
| Single GPU | Model fit | Quantization (FP8/INT4), KV cache compression |
| Single Node (8 GPU) | Memory bandwidth | TP, continuous batching, PagedAttention |
| Multi-Node (10-100) | Network bandwidth | PP, EP, prefill/decode disaggregation |
| Large Scale (100+) | Coordination, tail latency | Hierarchical routing, redundancy, caching |
Cost Optimization Strategies
- Right-size hardware: L40S/A10G for smaller models, H100/H200 for large
- Quantization: FP8 standard (1.2-1.5x throughput), INT4 for edge
- Batching efficiency: Maximize batch size within latency SLO
- Caching: Prefix caching, semantic caching for common queries
- Spot instances: For batch inference and non-latency-sensitive
9. Key Trade-offs to Discuss
"Why disaggregate prefill/decode?"
- Pro: Different resource profiles (compute vs memory BW), independent scaling, hardware specialization
- Con: KV cache transfer overhead, orchestration complexity, 20-30% overhead for small workloads
- Decision: Worth it at scale (>100 GPUs) or when workload is heavily skewed to long outputs
"When would you use speculative decoding?"
- When target model is memory-bandwidth bound (decode phase)
- When you have a good draft model with >70% acceptance rate
- When latency matters more than throughput
- Skip if: Already compute-bound, or draft model acceptance is low
"TP vs PP decision?"
- TP: Within NVLink domain (900 GB/s), minimizes latency, requires all-reduce every layer
- PP: Across nodes (InfiniBand 400 Gbps), tolerates lower bandwidth, has pipeline bubbles
- Hybrid: TP=8 within node + PP across nodes for very large models
"How to handle KV cache at scale?"
- PagedAttention for memory efficiency
- Prefix caching for common system prompts
- FP8 quantization for 2x capacity
- Distributed KV stores for disaggregated architectures
- Offloading to CPU/NVMe for long context (latency trade-off)