Multi-GPU LLM Serving Architecture

1. Requirements Gathering
2. High-Level Architecture
3. Parallelism Strategies
4. Memory Management (KV Cache)
5. Batching Strategies
6. Prefill/Decode Disaggregation
7. Speculative Decoding
8. Scaling Considerations
9. Key Trade-offs

1. Requirements Gathering

Before diving into architecture, clarify these with the interviewer:

Functional Requirements

Model size? (7B, 70B, 405B parameters)
Context length? (4K, 32K, 128K+ tokens)
Chat completion or batch inference?
Streaming responses required?

Non-Functional Requirements

Latency SLO: Time to First Token (TTFT), Inter-Token Latency (ITL)
Throughput: Tokens per second, concurrent users
Availability: 99.9%? 99.99%?
Cost constraints: $/1M tokens target

Example Clarification

"Let's design for a 70B parameter model, 32K context, chat completion with streaming. Target: TTFT < 500ms, ITL < 50ms, 1000 concurrent users, 99.9% availability."

2. High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              LOAD BALANCER                                   │
│                    (Route by model, session affinity)                        │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                            API GATEWAY                                       │
│         • Authentication    • Rate limiting    • Request validation          │
│         • Token counting    • Cost tracking    • Request queuing             │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                         REQUEST SCHEDULER                                    │
│    • Continuous batching         • Priority queues (SLO-aware)              │
│    • Prefix cache lookup         • Load balancing across workers            │
└──────────────┬──────────────────────────────────────────┬───────────────────┘
               │                                          │
               ▼                                          ▼
┌──────────────────────────────┐          ┌──────────────────────────────┐
│       PREFILL WORKERS        │          │       DECODE WORKERS         │
│  (Compute-optimized GPUs)    │          │  (Memory-optimized GPUs)     │
│                              │          │                              │
│  ┌────────┐ ┌────────┐       │          │  ┌────────┐ ┌────────┐       │
│  │ GPU 0  │ │ GPU 1  │       │          │  │ GPU 0  │ │ GPU 1  │       │
│  │        │ │        │       │          │  │        │ │        │       │
│  │ TP = 2 │ │ TP = 2 │       │          │  │ TP = 2 │ │ TP = 2 │       │
│  └────┬───┘ └───┬────┘       │          │  └────┬───┘ └───┬────┘       │
│       └────┬────┘            │          │       └────┬────┘            │
│            │ NVLink          │          │            │ NVLink          │
│  ┌────────┐ ┌────────┐       │          │  ┌────────┐ ┌────────┐       │
│  │ GPU 2  │ │ GPU 3  │       │          │  │ GPU 2  │ │ GPU 3  │       │
│  └────────┘ └────────┘       │          │  └────────┘ └────────────────┘
└──────────────┬───────────────┘          └──────────────┬───────────────┘
               │                                          │
               └────────────────┬─────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────────────────┐
│                        KV CACHE TRANSFER LAYER                               │
│              • RDMA / NVLink for low-latency transfer                        │
│              • Distributed KV store for large-scale                          │
└─────────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Component	Responsibility	Key Metrics
Load Balancer	Route requests, health checks, session affinity for streaming	Request latency, error rate
API Gateway	Auth, rate limiting, request validation, token counting	Requests/sec, rejection rate
Scheduler	Batch formation, prefix cache lookup, worker assignment	Queue depth, batch efficiency
Prefill Workers	Process input prompt, generate KV cache	TTFT, prefill tokens/sec
Decode Workers	Autoregressive token generation	ITL, decode tokens/sec

3. Parallelism Strategies

Tensor Parallelism (TP)

Split individual layers across GPUs. Each GPU holds a slice of every layer's weights.

                    Single Layer with TP=4
    ┌─────────────────────────────────────────────────┐
    │                   Input Tensor                   │
    └─────────────────────────┬───────────────────────┘
                              │ Scatter
         ┌────────────────────┼────────────────────┐
         │          │         │         │          │
    ┌────▼────┐ ┌───▼────┐ ┌──▼───┐ ┌───▼────┐
    │ GPU 0   │ │ GPU 1  │ │ GPU 2│ │ GPU 3  │
    │ W[0:25%]│ │W[25:50%]│ │W[50:75%]│ │W[75:100%]│
    └────┬────┘ └───┬────┘ └──┬───┘ └───┬────┘
         │          │         │         │
         └──────────┴────┬────┴─────────┘
                         │ All-Reduce (NVLink)
                         ▼
              ┌─────────────────────┐
              │    Output Tensor    │
              └─────────────────────┘

When to Use TP

Model doesn't fit in single GPU memory
Low latency requirements (all GPUs work on same token)
Within NVLink domain (high bandwidth all-reduce needed)
Typical: TP=2, 4, or 8 within a single node

Pipeline Parallelism (PP)

Assign different layer groups to different GPUs/nodes. Data flows through stages.

                    70B Model with PP=4

    Node 0              Node 1              Node 2              Node 3
    ┌─────────┐         ┌─────────┐         ┌─────────┐         ┌─────────┐
    │Layers   │ ──────► │Layers   │ ──────► │Layers   │ ──────► │Layers   │
    │ 0-19    │         │ 20-39   │         │ 40-59   │         │ 60-79   │
    │         │         │         │         │         │         │         │
    │ Stage 0 │         │ Stage 1 │         │ Stage 2 │         │ Stage 3 │
    └─────────┘         └─────────┘         └─────────┘         └─────────┘

    ────────────────────────────────────────────────────────────────────────►
                              Time / Token Flow

Pipeline Bubbles

PP has inherent inefficiency: when Stage 0 processes request B, Stages 1-3 are still processing request A. This creates "bubbles" of GPU idle time. Mitigate with micro-batching.

Hybrid Parallelism (TP + PP)

    405B Model: TP=8 within node, PP=4 across nodes

    ┌─────────────────────────────────────────────────────────────────────┐
    │ Node 0 (Layers 0-25)                                                │
    │ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐  │
    │ │GPU 0  │GPU 1  │GPU 2  │GPU 3  │GPU 4  │GPU 5  │GPU 6  │GPU 7  │  │
    │ │       │       │       │       │       │       │       │       │  │
    │ └───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┘  │
    │     └───────┴───────┴───────┴───┬───┴───────┴───────┴───────┘      │
    │                           NVLink │ (900 GB/s)                       │
    └─────────────────────────────────┼───────────────────────────────────┘
                                      │ InfiniBand (400 Gbps)
    ┌─────────────────────────────────▼───────────────────────────────────┐
    │ Node 1 (Layers 26-50)                                               │
    │ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐  │
    │ │GPU 0  │GPU 1  │GPU 2  │GPU 3  │GPU 4  │GPU 5  │GPU 6  │GPU 7  │  │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
                              (Nodes 2, 3...)

Strategy	Best For	Communication	Limitation
Tensor Parallelism	Low latency, within node	All-reduce every layer	Requires high bandwidth (NVLink)
Pipeline Parallelism	Cross-node, memory-limited	Point-to-point between stages	Pipeline bubbles
Data Parallelism	High throughput	None (inference)	Model must fit per replica
Expert Parallelism	MoE models (Mixtral)	All-to-all routing	Load imbalance

4. Memory Management (KV Cache)

The Problem

For each token generated, we must store Key and Value tensors for all previous tokens, across all layers and attention heads. This dominates GPU memory.

KV Cache Size Calculation

For Llama 70B with 80 layers, 64 heads, 128 head_dim, 2048 context at FP16:

2 (K+V) × 80 layers × 64 heads × 128 dim × 2048 tokens × 2 bytes = ~84 GB per sequence

PagedAttention (vLLM Innovation)

    Traditional KV Cache                    PagedAttention
    ─────────────────────                   ─────────────────────

    ┌─────────────────────┐                 ┌───┬───┬───┬───┬───┐
    │ Seq 1: 1500 tokens  │                 │ P1│ P2│ P3│ P4│ P5│ ← Seq 1 (pages)
    │ [allocated: 2048]   │                 └───┴───┴───┴───┴───┘
    │ [wasted: 548]       │
    └─────────────────────┘                 ┌───┬───┬───┐
    ┌─────────────────────┐                 │ P6│ P7│ P8│ ← Seq 2
    │ Seq 2: 800 tokens   │                 └───┴───┴───┘
    │ [allocated: 2048]   │
    │ [wasted: 1248]      │                 ┌───┬───┬───┬───┬───┬───┐
    └─────────────────────┘                 │ P9│P10│P11│P12│P13│P14│ ← Seq 3
    ┌─────────────────────┐                 └───┴───┴───┴───┴───┴───┘
    │ Seq 3: 2000 tokens  │
    │ [allocated: 2048]   │                 Pages allocated on demand
    │ [wasted: 48]        │                 No external fragmentation
    └─────────────────────┘                 Can share prefix pages!

    Total waste: ~45%                       Total waste: <5%

Prefix Caching

When multiple requests share a common prefix (system prompt, few-shot examples), cache and reuse the KV for that prefix.

    System Prompt: "You are a helpful assistant..." (500 tokens)

    Request 1: [System Prompt] + "What is 2+2?"
    Request 2: [System Prompt] + "Explain quantum computing"
    Request 3: [System Prompt] + "Write a poem"

    ┌─────────────────────────────┐
    │    Cached Prefix KV         │ ← Computed once
    │    (500 tokens)             │
    └──────────────┬──────────────┘
                   │ Shared
         ┌─────────┼─────────┐
         ▼         ▼         ▼
    ┌─────────┐ ┌─────────┐ ┌─────────┐
    │ Req 1   │ │ Req 2   │ │ Req 3   │
    │ unique  │ │ unique  │ │ unique  │
    └─────────┘ └─────────┘ └─────────┘

    Savings: 3× reduction in prefill compute for shared prefix

5. Batching Strategies

Static vs Continuous Batching

    STATIC BATCHING
    ═══════════════════════════════════════════════════════════════════

    Time →  ─────────────────────────────────────────────────────────►

    Batch 1 ████████████████████████████████████████████████
            │ Req A (100 tokens) │ Req B (50 tokens)   │ Req C (200 tokens)
            │                    │ ░░░░░░░░░░░░░░░░░░░ │  (waiting)
            │                    │     (padding)        │
            └────────────────────┴──────────────────────┘
                                                        ▲
                                                        │
                                            All must wait for longest


    CONTINUOUS BATCHING
    ═══════════════════════════════════════════════════════════════════

    Time →  ─────────────────────────────────────────────────────────►

            ████████████████████████████████████████████████████████
    Req A   │ ████████████████████ │ (done, exits batch)
    Req B   │ ██████████ │ (done)  │
    Req C   │            │ █████████████████████████████████████████
    Req D   │            │         │ ████████████████████████████████
    Req E   │            │         │         │ ████████████████████████
            └────────────┴─────────┴─────────┴────────────────────────
                         ▲         ▲         ▲
                         │         │         │
                    New requests join as slots free up

Aspect	Static Batching	Continuous Batching
Latency	Higher (wait for batch + longest seq)	Lower (immediate insertion)
Throughput	Lower (padding waste)	Higher (no padding)
Memory	Fixed, predictable	Dynamic, needs careful management
Complexity	Simple	Complex scheduling

Industry Standard (2025)

Continuous batching is now default in all major serving frameworks (vLLM, TensorRT-LLM, TGI). Static batching is only used for specific batch inference workloads.

6. Prefill/Decode Disaggregation

Why Disaggregate?

Phase	Characteristic	Bottleneck	Optimal Hardware
Prefill	Process all input tokens in parallel	Compute (matrix multiply)	High FLOPS (H100, TPU)
Decode	Generate one token at a time	Memory bandwidth	High HBM bandwidth (H200, MI300X)

    DISAGGREGATED ARCHITECTURE
    ═══════════════════════════════════════════════════════════════════

                     ┌─────────────────────┐
                     │   Request Router    │
                     └──────────┬──────────┘
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
    ┌─────────────────────┐             ┌─────────────────────┐
    │   PREFILL CLUSTER   │             │   DECODE CLUSTER    │
    │                     │             │                     │
    │  • H100 GPUs        │             │  • H200 / MI300X    │
    │  • High FLOPS       │   KV Cache  │  • High HBM BW      │
    │  • Fewer GPUs       │ ──────────► │  • More GPUs        │
    │  • Process prompts  │   Transfer  │  • Token generation │
    │                     │             │                     │
    └─────────────────────┘             └─────────────────────┘

    Benefits:
    • Scale prefill and decode independently
    • Match hardware to workload characteristics
    • Better overall resource utilization

    Challenges:
    • KV cache transfer latency (need RDMA/NVLink)
    • Orchestration complexity
    • 20-30% overhead for small requests

7. Speculative Decoding

Use a small "draft" model to predict multiple tokens, then verify with the target model in parallel.

    SPECULATIVE DECODING FLOW
    ═══════════════════════════════════════════════════════════════════

    Draft Model (7B)              Target Model (70B)
    ────────────────              ─────────────────

    Step 1: Draft predicts K=4 tokens

    "The capital of France"  →  ["is", "Paris", ",", "which"]
                                   t1    t2     t3    t4

    Step 2: Target verifies ALL 4 in parallel (single forward pass)

    Target checks:
    • P(is | context) > threshold?      ✓ Accept
    • P(Paris | context + is) > thresh? ✓ Accept
    • P(, | context + is + Paris)?      ✓ Accept
    • P(which | ... )?                  ✗ Reject → Generate "a"

    Step 3: Accept longest matching prefix + generate next token

    Output: "is", "Paris", ",", "a"  (4 tokens from 2 forward passes!)

    ─────────────────────────────────────────────────────────────────

    Traditional: 4 tokens = 4 forward passes
    Speculative: 4 tokens = 2 forward passes (draft + verify)

    Speedup: Up to 2-3x when draft model matches well

When Speculative Decoding Helps

Target model is memory-bandwidth bound (decode phase)
Draft model has high acceptance rate (>70%)
Draft model is 5-10x smaller than target

Trade-off: Increases compute (running two models) to reduce latency. Not helpful if already compute-bound.

8. Scaling Considerations

Scale	Primary Challenge	Solution Pattern
Single GPU	Model fit	Quantization (FP8/INT4), KV cache compression
Single Node (8 GPU)	Memory bandwidth	TP, continuous batching, PagedAttention
Multi-Node (10-100)	Network bandwidth	PP, EP, prefill/decode disaggregation
Large Scale (100+)	Coordination, tail latency	Hierarchical routing, redundancy, caching

Cost Optimization Strategies

Right-size hardware: L40S/A10G for smaller models, H100/H200 for large
Quantization: FP8 standard (1.2-1.5x throughput), INT4 for edge
Batching efficiency: Maximize batch size within latency SLO
Caching: Prefix caching, semantic caching for common queries
Spot instances: For batch inference and non-latency-sensitive

9. Key Trade-offs to Discuss

"Why disaggregate prefill/decode?"

Pro: Different resource profiles (compute vs memory BW), independent scaling, hardware specialization
Con: KV cache transfer overhead, orchestration complexity, 20-30% overhead for small workloads
Decision: Worth it at scale (>100 GPUs) or when workload is heavily skewed to long outputs

"When would you use speculative decoding?"

When target model is memory-bandwidth bound (decode phase)
When you have a good draft model with >70% acceptance rate
When latency matters more than throughput
Skip if: Already compute-bound, or draft model acceptance is low

"TP vs PP decision?"

TP: Within NVLink domain (900 GB/s), minimizes latency, requires all-reduce every layer
PP: Across nodes (InfiniBand 400 Gbps), tolerates lower bandwidth, has pipeline bubbles
Hybrid: TP=8 within node + PP across nodes for very large models

"How to handle KV cache at scale?"

PagedAttention for memory efficiency
Prefix caching for common system prompts
FP8 quantization for 2x capacity
Distributed KV stores for disaggregated architectures
Offloading to CPU/NVMe for long context (latency trade-off)

Contents