← Back to Interview Prep

Multi-Architecture LLM Serving

Designing systems that serve LLMs across Google TPU, NVIDIA GPU, and AMD GPU with unified APIs and optimal resource utilization.

The Challenge

Modern cloud providers and large enterprises operate heterogeneous AI infrastructure:

How do you design a serving system that abstracts hardware differences while maximizing efficiency?

Reference Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           UNIFIED API GATEWAY                                │
│                                                                              │
│    • OpenAI-compatible API        • Model routing                           │
│    • Token counting               • Request validation                      │
│    • Authentication               • Rate limiting                           │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │
┌─────────────────────────────────────▼───────────────────────────────────────┐
│                         INTELLIGENT ROUTER                                   │
│                                                                              │
│    Routing Policies:                                                         │
│    ┌─────────────────────────────────────────────────────────────────────┐  │
│    │ • SLO-based: Route to fastest hardware meeting latency target       │  │
│    │ • Cost-based: Route to cheapest option meeting SLO                  │  │
│    │ • Affinity-based: Specific models → specific hardware               │  │
│    │ • Load-based: Balance across available capacity                     │  │
│    └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│    Model Registry:                                                           │
│    ┌───────────────┬─────────────────┬───────────────────────────────────┐  │
│    │ Model         │ Formats         │ Hardware Affinity                 │  │
│    ├───────────────┼─────────────────┼───────────────────────────────────┤  │
│    │ Llama-70B     │ SafeTensors,    │ NVIDIA (TensorRT), AMD (vLLM),   │  │
│    │               │ TensorRT, JAX   │ TPU (JAX)                         │  │
│    └───────────────┴─────────────────┴───────────────────────────────────┘  │
└───────┬─────────────────────────┬─────────────────────────┬─────────────────┘
        │                         │                         │
        ▼                         ▼                         ▼
┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
│   GOOGLE TPU     │   │   NVIDIA GPU     │   │   AMD GPU        │
│   POOL            │   │   POOL            │   │   POOL            │
│                   │   │                   │   │                   │
│ ┌───────────────┐ │   │ ┌───────────────┐ │   │ ┌───────────────┐ │
│ │ TPU v5e Pod   │ │   │ │ H100 Cluster  │ │   │ │ MI300X Cluster│ │
│ │ (Inference)   │ │   │ │ (General)     │ │   │ │ (High Memory) │ │
│ └───────────────┘ │   │ └───────────────┘ │   │ └───────────────┘ │
│                   │   │                   │   │                   │
│ ┌───────────────┐ │   │ ┌───────────────┐ │   │ ┌───────────────┐ │
│ │ TPU v5p Pod   │ │   │ │ H200 Cluster  │ │   │ │ MI325X Cluster│ │
│ │ (Training)    │ │   │ │ (Large Models)│ │   │ │ (Future)      │ │
│ └───────────────┘ │   │ └───────────────┘ │   │ └───────────────┘ │
│                   │   │                   │   │                   │
│ Framework: JAX    │   │ Framework:        │   │ Framework:        │
│ Runtime: XLA      │   │ TensorRT-LLM      │   │ vLLM + ROCm       │
└───────────────────┘   └───────────────────┘   └───────────────────┘
        │                         │                         │
        └─────────────────────────┼─────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                      UNIFIED OBSERVABILITY                                   │
│                                                                              │
│    • Tokens/sec per hardware type    • Cost per 1M tokens                   │
│    • TTFT / ITL distributions        • Utilization metrics                  │
│    • Error rates by pool             • Capacity planning                    │
└─────────────────────────────────────────────────────────────────────────────┘
                

Key Design Decisions

1. Model Format Strategy

Format Use Case Hardware Notes
SafeTensors Canonical storage, training Universal Safe, fast loading, mmap-friendly
TensorRT Engine NVIDIA production inference NVIDIA only Requires per-GPU compilation
JAX Checkpoints TPU inference TPU / GPU XLA compilation
GGUF CPU/edge inference CPU, Apple Silicon llama.cpp ecosystem
Recommended Pattern
  • Store canonical weights in SafeTensors
  • CI/CD pipeline builds platform-specific artifacts:
    • NVIDIA: TensorRT engine per GPU type (H100, H200)
    • AMD: vLLM-compatible SafeTensors (no compilation needed)
    • TPU: JAX checkpoint with XLA optimizations
  • Version artifacts with model version + hardware target

2. Routing Strategies

SLO-Based Routing

Route to the fastest available hardware that meets the latency target.

Request: Llama-70B, SLO: TTFT < 500ms

Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool          │ P50 TTFT  │ Capacity │ Decision                │
├───────────────┼───────────┼──────────┼─────────────────────────┤
│ H100 Cluster  │ 180ms     │ 85%      │ ✓ Best option           │
│ MI300X        │ 220ms     │ 60%      │ ✓ Backup                │
│ TPU v5e       │ 450ms     │ 40%      │ ✓ Acceptable            │
│ H200 (busy)   │ 150ms     │ 98%      │ ✗ Over capacity         │
└─────────────────────────────────────────────────────────────────┘

Route to: H100 Cluster (fastest with capacity)
                

Cost-Based Routing

Route to the cheapest option that meets the SLO.

Request: Batch inference, SLO: Complete within 1 hour

Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool          │ $/1M tokens │ Meets SLO │ Decision              │
├───────────────┼─────────────┼───────────┼───────────────────────┤
│ TPU v5e       │ $0.80       │ ✓         │ ✓ Cheapest            │
│ MI300X        │ $1.20       │ ✓         │   Backup              │
│ H100          │ $2.50       │ ✓         │   More expensive      │
│ H200          │ $3.00       │ ✓         │   Premium             │
└─────────────────────────────────────────────────────────────────┘

Route to: TPU v5e (cheapest meeting SLO)
                

Affinity-Based Routing

Some models perform better on specific hardware due to optimization or memory requirements.

Model Best Hardware Reason
Llama 405B MI300X 256GB HBM fits model + large KV cache
Gemini (Google) TPU v5p Native JAX, optimized for TPU
GPT-4 (inference) H200 TensorRT-LLM optimizations
Mixtral 8x22B H100 (TP=8) Expert parallelism, NVLink bandwidth

3. Hardware-Specific Optimizations

NVIDIA Optimizations
  • TensorRT-LLM: Custom attention kernels, FP8 quantization, inflight batching
  • NVLink: Use TP within NVLink domain (900 GB/s on H100)
  • Multi-Instance GPU (MIG): Partition H100 for smaller models
  • Flash Attention: Memory-efficient attention (standard in vLLM, TensorRT-LLM)
AMD Optimizations
  • ROCm 7: 3.5x inference uplift vs ROCm 6
  • vLLM: Full support with optimized Docker images
  • Memory advantage: 256GB HBM3e enables larger batches, longer context
  • Infinity Fabric: AMD's interconnect for multi-GPU
TPU Optimizations
  • JAX/XLA: Compiler optimizations, automatic sharding
  • ICI (Inter-Core Interconnect): 4800 Gbps on v5p for pod-scale
  • SparseCore: 5-7x speedup for embedding operations
  • MegaCore: Two cores share memory, operate as one

KV Cache Across Architectures

In a disaggregated or multi-architecture setup, KV cache transfer is critical.

Option 1: Stateless (Re-compute)

Each request is self-contained. If routed to different hardware, re-compute prefill.

Option 2: Sticky Sessions

Pin conversation to same hardware pool for duration.

Option 3: Distributed KV Store

Externalize KV cache to Redis/custom store, transfer on routing decision.

┌─────────────────────────────────────────────────────────────────────────────┐
│                          DISTRIBUTED KV CACHE                                │
│                                                                              │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                  │
│    │  Session A  │     │  Session B  │     │  Session C  │                  │
│    │  KV Cache   │     │  KV Cache   │     │  KV Cache   │                  │
│    └──────┬──────┘     └──────┬──────┘     └──────┬──────┘                  │
│           │                   │                   │                          │
│    ┌──────▼───────────────────▼───────────────────▼──────┐                  │
│    │              RDMA-Enabled KV Store                   │                  │
│    │         (Redis Cluster / Custom Solution)            │                  │
│    └──────┬───────────────────┬───────────────────┬──────┘                  │
│           │                   │                   │                          │
└───────────┼───────────────────┼───────────────────┼──────────────────────────┘
            │                   │                   │
            ▼                   ▼                   ▼
    ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
    │ NVIDIA Pool   │   │  AMD Pool     │   │  TPU Pool     │
    │ (Pull on need)│   │ (Pull on need)│   │ (Pull on need)│
    └───────────────┘   └───────────────┘   └───────────────┘
                
KV Transfer Latency

For a 70B model with 32K context, KV cache is ~40GB. Transfer times:

  • NVLink (900 GB/s): ~45ms
  • InfiniBand (400 Gbps): ~800ms
  • Ethernet (100 Gbps): ~3.2s

This is why disaggregation typically stays within high-bandwidth domains.

Real-World Examples

Microsoft Azure

Google Cloud

Meta

Interview Discussion Points

"How would you handle model versioning across architectures?"
  • Canonical weights in SafeTensors (source of truth)
  • CI/CD builds hardware-specific artifacts on model update
  • Blue-green deployments per hardware pool
  • Gradual rollout with traffic splitting
"What if NVIDIA has a supply shortage?"
  • This is exactly why multi-architecture matters
  • Automatic failover to AMD/TPU pools
  • Cost model adjusts (may pay premium for available capacity)
  • Long-term: Diversified purchasing strategy
"How do you ensure consistent output across hardware?"
  • Same model weights, but floating-point differences exist
  • Evaluation suite runs on all hardware targets
  • Accept small numerical differences (FP16/FP8 variance)
  • Critical: Temperature=0 may still differ slightly