← Back to Interview Prep

Multi-GPU LLM Serving Architecture

System design for serving Large Language Models across multiple GPUs, with parallelism strategies, memory optimization, and production patterns.

1. Requirements Gathering

Before diving into architecture, clarify these with the interviewer:

Functional Requirements

Non-Functional Requirements

Example Clarification

"Let's design for a 70B parameter model, 32K context, chat completion with streaming. Target: TTFT < 500ms, ITL < 50ms, 1000 concurrent users, 99.9% availability."

2. High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              LOAD BALANCER                                   │
│                    (Route by model, session affinity)                        │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                            API GATEWAY                                       │
│         • Authentication    • Rate limiting    • Request validation          │
│         • Token counting    • Cost tracking    • Request queuing             │
└─────────────────────────────────┬───────────────────────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                         REQUEST SCHEDULER                                    │
│    • Continuous batching         • Priority queues (SLO-aware)              │
│    • Prefix cache lookup         • Load balancing across workers            │
└──────────────┬──────────────────────────────────────────┬───────────────────┘
               │                                          │
               ▼                                          ▼
┌──────────────────────────────┐          ┌──────────────────────────────┐
│       PREFILL WORKERS        │          │       DECODE WORKERS         │
│  (Compute-optimized GPUs)    │          │  (Memory-optimized GPUs)     │
│                              │          │                              │
│  ┌────────┐ ┌────────┐       │          │  ┌────────┐ ┌────────┐       │
│  │ GPU 0  │ │ GPU 1  │       │          │  │ GPU 0  │ │ GPU 1  │       │
│  │        │ │        │       │          │  │        │ │        │       │
│  │ TP = 2 │ │ TP = 2 │       │          │  │ TP = 2 │ │ TP = 2 │       │
│  └────┬───┘ └───┬────┘       │          │  └────┬───┘ └───┬────┘       │
│       └────┬────┘            │          │       └────┬────┘            │
│            │ NVLink          │          │            │ NVLink          │
│  ┌────────┐ ┌────────┐       │          │  ┌────────┐ ┌────────┐       │
│  │ GPU 2  │ │ GPU 3  │       │          │  │ GPU 2  │ │ GPU 3  │       │
│  └────────┘ └────────┘       │          │  └────────┘ └────────────────┘
└──────────────┬───────────────┘          └──────────────┬───────────────┘
               │                                          │
               └────────────────┬─────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────────────────┐
│                        KV CACHE TRANSFER LAYER                               │
│              • RDMA / NVLink for low-latency transfer                        │
│              • Distributed KV store for large-scale                          │
└─────────────────────────────────────────────────────────────────────────────┘
                

Component Responsibilities

Component Responsibility Key Metrics
Load Balancer Route requests, health checks, session affinity for streaming Request latency, error rate
API Gateway Auth, rate limiting, request validation, token counting Requests/sec, rejection rate
Scheduler Batch formation, prefix cache lookup, worker assignment Queue depth, batch efficiency
Prefill Workers Process input prompt, generate KV cache TTFT, prefill tokens/sec
Decode Workers Autoregressive token generation ITL, decode tokens/sec

3. Parallelism Strategies

Tensor Parallelism (TP)

Split individual layers across GPUs. Each GPU holds a slice of every layer's weights.

                    Single Layer with TP=4
    ┌─────────────────────────────────────────────────┐
    │                   Input Tensor                   │
    └─────────────────────────┬───────────────────────┘
                              │ Scatter
         ┌────────────────────┼────────────────────┐
         │          │         │         │          │
    ┌────▼────┐ ┌───▼────┐ ┌──▼───┐ ┌───▼────┐
    │ GPU 0   │ │ GPU 1  │ │ GPU 2│ │ GPU 3  │
    │ W[0:25%]│ │W[25:50%]│ │W[50:75%]│ │W[75:100%]│
    └────┬────┘ └───┬────┘ └──┬───┘ └───┬────┘
         │          │         │         │
         └──────────┴────┬────┴─────────┘
                         │ All-Reduce (NVLink)
                         ▼
              ┌─────────────────────┐
              │    Output Tensor    │
              └─────────────────────┘
                
When to Use TP
  • Model doesn't fit in single GPU memory
  • Low latency requirements (all GPUs work on same token)
  • Within NVLink domain (high bandwidth all-reduce needed)
  • Typical: TP=2, 4, or 8 within a single node

Pipeline Parallelism (PP)

Assign different layer groups to different GPUs/nodes. Data flows through stages.

                    70B Model with PP=4

    Node 0              Node 1              Node 2              Node 3
    ┌─────────┐         ┌─────────┐         ┌─────────┐         ┌─────────┐
    │Layers   │ ──────► │Layers   │ ──────► │Layers   │ ──────► │Layers   │
    │ 0-19    │         │ 20-39   │         │ 40-59   │         │ 60-79   │
    │         │         │         │         │         │         │         │
    │ Stage 0 │         │ Stage 1 │         │ Stage 2 │         │ Stage 3 │
    └─────────┘         └─────────┘         └─────────┘         └─────────┘

    ────────────────────────────────────────────────────────────────────────►
                              Time / Token Flow
                
Pipeline Bubbles

PP has inherent inefficiency: when Stage 0 processes request B, Stages 1-3 are still processing request A. This creates "bubbles" of GPU idle time. Mitigate with micro-batching.

Hybrid Parallelism (TP + PP)

    405B Model: TP=8 within node, PP=4 across nodes

    ┌─────────────────────────────────────────────────────────────────────┐
    │ Node 0 (Layers 0-25)                                                │
    │ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐  │
    │ │GPU 0  │GPU 1  │GPU 2  │GPU 3  │GPU 4  │GPU 5  │GPU 6  │GPU 7  │  │
    │ │       │       │       │       │       │       │       │       │  │
    │ └───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┴───┬───┘  │
    │     └───────┴───────┴───────┴───┬───┴───────┴───────┴───────┘      │
    │                           NVLink │ (900 GB/s)                       │
    └─────────────────────────────────┼───────────────────────────────────┘
                                      │ InfiniBand (400 Gbps)
    ┌─────────────────────────────────▼───────────────────────────────────┐
    │ Node 1 (Layers 26-50)                                               │
    │ ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐  │
    │ │GPU 0  │GPU 1  │GPU 2  │GPU 3  │GPU 4  │GPU 5  │GPU 6  │GPU 7  │  │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
                              (Nodes 2, 3...)
                
Strategy Best For Communication Limitation
Tensor Parallelism Low latency, within node All-reduce every layer Requires high bandwidth (NVLink)
Pipeline Parallelism Cross-node, memory-limited Point-to-point between stages Pipeline bubbles
Data Parallelism High throughput None (inference) Model must fit per replica
Expert Parallelism MoE models (Mixtral) All-to-all routing Load imbalance

4. Memory Management (KV Cache)

The Problem

For each token generated, we must store Key and Value tensors for all previous tokens, across all layers and attention heads. This dominates GPU memory.

KV Cache Size Calculation

For Llama 70B with 80 layers, 64 heads, 128 head_dim, 2048 context at FP16:

2 (K+V) × 80 layers × 64 heads × 128 dim × 2048 tokens × 2 bytes = ~84 GB per sequence

PagedAttention (vLLM Innovation)

    Traditional KV Cache                    PagedAttention
    ─────────────────────                   ─────────────────────

    ┌─────────────────────┐                 ┌───┬───┬───┬───┬───┐
    │ Seq 1: 1500 tokens  │                 │ P1│ P2│ P3│ P4│ P5│ ← Seq 1 (pages)
    │ [allocated: 2048]   │                 └───┴───┴───┴───┴───┘
    │ [wasted: 548]       │
    └─────────────────────┘                 ┌───┬───┬───┐
    ┌─────────────────────┐                 │ P6│ P7│ P8│ ← Seq 2
    │ Seq 2: 800 tokens   │                 └───┴───┴───┘
    │ [allocated: 2048]   │
    │ [wasted: 1248]      │                 ┌───┬───┬───┬───┬───┬───┐
    └─────────────────────┘                 │ P9│P10│P11│P12│P13│P14│ ← Seq 3
    ┌─────────────────────┐                 └───┴───┴───┴───┴───┴───┘
    │ Seq 3: 2000 tokens  │
    │ [allocated: 2048]   │                 Pages allocated on demand
    │ [wasted: 48]        │                 No external fragmentation
    └─────────────────────┘                 Can share prefix pages!

    Total waste: ~45%                       Total waste: <5%
                

Prefix Caching

When multiple requests share a common prefix (system prompt, few-shot examples), cache and reuse the KV for that prefix.

    System Prompt: "You are a helpful assistant..." (500 tokens)

    Request 1: [System Prompt] + "What is 2+2?"
    Request 2: [System Prompt] + "Explain quantum computing"
    Request 3: [System Prompt] + "Write a poem"

    ┌─────────────────────────────┐
    │    Cached Prefix KV         │ ← Computed once
    │    (500 tokens)             │
    └──────────────┬──────────────┘
                   │ Shared
         ┌─────────┼─────────┐
         ▼         ▼         ▼
    ┌─────────┐ ┌─────────┐ ┌─────────┐
    │ Req 1   │ │ Req 2   │ │ Req 3   │
    │ unique  │ │ unique  │ │ unique  │
    └─────────┘ └─────────┘ └─────────┘

    Savings: 3× reduction in prefill compute for shared prefix
                

5. Batching Strategies

Static vs Continuous Batching

    STATIC BATCHING
    ═══════════════════════════════════════════════════════════════════

    Time →  ─────────────────────────────────────────────────────────►

    Batch 1 ████████████████████████████████████████████████
            │ Req A (100 tokens) │ Req B (50 tokens)   │ Req C (200 tokens)
            │                    │ ░░░░░░░░░░░░░░░░░░░ │  (waiting)
            │                    │     (padding)        │
            └────────────────────┴──────────────────────┘
                                                        ▲
                                                        │
                                            All must wait for longest


    CONTINUOUS BATCHING
    ═══════════════════════════════════════════════════════════════════

    Time →  ─────────────────────────────────────────────────────────►

            ████████████████████████████████████████████████████████
    Req A   │ ████████████████████ │ (done, exits batch)
    Req B   │ ██████████ │ (done)  │
    Req C   │            │ █████████████████████████████████████████
    Req D   │            │         │ ████████████████████████████████
    Req E   │            │         │         │ ████████████████████████
            └────────────┴─────────┴─────────┴────────────────────────
                         ▲         ▲         ▲
                         │         │         │
                    New requests join as slots free up
                
Aspect Static Batching Continuous Batching
Latency Higher (wait for batch + longest seq) Lower (immediate insertion)
Throughput Lower (padding waste) Higher (no padding)
Memory Fixed, predictable Dynamic, needs careful management
Complexity Simple Complex scheduling
Industry Standard (2025)

Continuous batching is now default in all major serving frameworks (vLLM, TensorRT-LLM, TGI). Static batching is only used for specific batch inference workloads.

6. Prefill/Decode Disaggregation

Why Disaggregate?

Phase Characteristic Bottleneck Optimal Hardware
Prefill Process all input tokens in parallel Compute (matrix multiply) High FLOPS (H100, TPU)
Decode Generate one token at a time Memory bandwidth High HBM bandwidth (H200, MI300X)
    DISAGGREGATED ARCHITECTURE
    ═══════════════════════════════════════════════════════════════════

                     ┌─────────────────────┐
                     │   Request Router    │
                     └──────────┬──────────┘
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
    ┌─────────────────────┐             ┌─────────────────────┐
    │   PREFILL CLUSTER   │             │   DECODE CLUSTER    │
    │                     │             │                     │
    │  • H100 GPUs        │             │  • H200 / MI300X    │
    │  • High FLOPS       │   KV Cache  │  • High HBM BW      │
    │  • Fewer GPUs       │ ──────────► │  • More GPUs        │
    │  • Process prompts  │   Transfer  │  • Token generation │
    │                     │             │                     │
    └─────────────────────┘             └─────────────────────┘

    Benefits:
    • Scale prefill and decode independently
    • Match hardware to workload characteristics
    • Better overall resource utilization

    Challenges:
    • KV cache transfer latency (need RDMA/NVLink)
    • Orchestration complexity
    • 20-30% overhead for small requests
                

7. Speculative Decoding

Use a small "draft" model to predict multiple tokens, then verify with the target model in parallel.

    SPECULATIVE DECODING FLOW
    ═══════════════════════════════════════════════════════════════════

    Draft Model (7B)              Target Model (70B)
    ────────────────              ─────────────────

    Step 1: Draft predicts K=4 tokens

    "The capital of France"  →  ["is", "Paris", ",", "which"]
                                   t1    t2     t3    t4

    Step 2: Target verifies ALL 4 in parallel (single forward pass)

    Target checks:
    • P(is | context) > threshold?      ✓ Accept
    • P(Paris | context + is) > thresh? ✓ Accept
    • P(, | context + is + Paris)?      ✓ Accept
    • P(which | ... )?                  ✗ Reject → Generate "a"

    Step 3: Accept longest matching prefix + generate next token

    Output: "is", "Paris", ",", "a"  (4 tokens from 2 forward passes!)

    ─────────────────────────────────────────────────────────────────

    Traditional: 4 tokens = 4 forward passes
    Speculative: 4 tokens = 2 forward passes (draft + verify)

    Speedup: Up to 2-3x when draft model matches well
                
When Speculative Decoding Helps
  • Target model is memory-bandwidth bound (decode phase)
  • Draft model has high acceptance rate (>70%)
  • Draft model is 5-10x smaller than target

Trade-off: Increases compute (running two models) to reduce latency. Not helpful if already compute-bound.

8. Scaling Considerations

Scale Primary Challenge Solution Pattern
Single GPU Model fit Quantization (FP8/INT4), KV cache compression
Single Node (8 GPU) Memory bandwidth TP, continuous batching, PagedAttention
Multi-Node (10-100) Network bandwidth PP, EP, prefill/decode disaggregation
Large Scale (100+) Coordination, tail latency Hierarchical routing, redundancy, caching

Cost Optimization Strategies

  1. Right-size hardware: L40S/A10G for smaller models, H100/H200 for large
  2. Quantization: FP8 standard (1.2-1.5x throughput), INT4 for edge
  3. Batching efficiency: Maximize batch size within latency SLO
  4. Caching: Prefix caching, semantic caching for common queries
  5. Spot instances: For batch inference and non-latency-sensitive

9. Key Trade-offs to Discuss

"Why disaggregate prefill/decode?"
  • Pro: Different resource profiles (compute vs memory BW), independent scaling, hardware specialization
  • Con: KV cache transfer overhead, orchestration complexity, 20-30% overhead for small workloads
  • Decision: Worth it at scale (>100 GPUs) or when workload is heavily skewed to long outputs
"When would you use speculative decoding?"
  • When target model is memory-bandwidth bound (decode phase)
  • When you have a good draft model with >70% acceptance rate
  • When latency matters more than throughput
  • Skip if: Already compute-bound, or draft model acceptance is low
"TP vs PP decision?"
  • TP: Within NVLink domain (900 GB/s), minimizes latency, requires all-reduce every layer
  • PP: Across nodes (InfiniBand 400 Gbps), tolerates lower bandwidth, has pipeline bubbles
  • Hybrid: TP=8 within node + PP across nodes for very large models
"How to handle KV cache at scale?"
  • PagedAttention for memory efficiency
  • Prefix caching for common system prompts
  • FP8 quantization for 2x capacity
  • Distributed KV stores for disaggregated architectures
  • Offloading to CPU/NVMe for long context (latency trade-off)