The Challenge
Modern cloud providers and large enterprises operate heterogeneous AI infrastructure:
- Microsoft Azure: NVIDIA H100/H200 + AMD MI300X for Azure OpenAI
- Google Cloud: TPU v5e/v5p + NVIDIA GPUs
- Meta: Custom MTIA + NVIDIA GPUs + exploring TPU partnership
How do you design a serving system that abstracts hardware differences while maximizing efficiency?
Reference Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ UNIFIED API GATEWAY │
│ │
│ • OpenAI-compatible API • Model routing │
│ • Token counting • Request validation │
│ • Authentication • Rate limiting │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌─────────────────────────────────────▼───────────────────────────────────────┐
│ INTELLIGENT ROUTER │
│ │
│ Routing Policies: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ • SLO-based: Route to fastest hardware meeting latency target │ │
│ │ • Cost-based: Route to cheapest option meeting SLO │ │
│ │ • Affinity-based: Specific models → specific hardware │ │
│ │ • Load-based: Balance across available capacity │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Model Registry: │
│ ┌───────────────┬─────────────────┬───────────────────────────────────┐ │
│ │ Model │ Formats │ Hardware Affinity │ │
│ ├───────────────┼─────────────────┼───────────────────────────────────┤ │
│ │ Llama-70B │ SafeTensors, │ NVIDIA (TensorRT), AMD (vLLM), │ │
│ │ │ TensorRT, JAX │ TPU (JAX) │ │
│ └───────────────┴─────────────────┴───────────────────────────────────┘ │
└───────┬─────────────────────────┬─────────────────────────┬─────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ GOOGLE TPU │ │ NVIDIA GPU │ │ AMD GPU │
│ POOL │ │ POOL │ │ POOL │
│ │ │ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ TPU v5e Pod │ │ │ │ H100 Cluster │ │ │ │ MI300X Cluster│ │
│ │ (Inference) │ │ │ │ (General) │ │ │ │ (High Memory) │ │
│ └───────────────┘ │ │ └───────────────┘ │ │ └───────────────┘ │
│ │ │ │ │ │
│ ┌───────────────┐ │ │ ┌───────────────┐ │ │ ┌───────────────┐ │
│ │ TPU v5p Pod │ │ │ │ H200 Cluster │ │ │ │ MI325X Cluster│ │
│ │ (Training) │ │ │ │ (Large Models)│ │ │ │ (Future) │ │
│ └───────────────┘ │ │ └───────────────┘ │ │ └───────────────┘ │
│ │ │ │ │ │
│ Framework: JAX │ │ Framework: │ │ Framework: │
│ Runtime: XLA │ │ TensorRT-LLM │ │ vLLM + ROCm │
└───────────────────┘ └───────────────────┘ └───────────────────┘
│ │ │
└─────────────────────────┼─────────────────────────┘
│
┌─────────────────────────────────▼───────────────────────────────────────────┐
│ UNIFIED OBSERVABILITY │
│ │
│ • Tokens/sec per hardware type • Cost per 1M tokens │
│ • TTFT / ITL distributions • Utilization metrics │
│ • Error rates by pool • Capacity planning │
└─────────────────────────────────────────────────────────────────────────────┘
Key Design Decisions
1. Model Format Strategy
| Format | Use Case | Hardware | Notes |
|---|---|---|---|
| SafeTensors | Canonical storage, training | Universal | Safe, fast loading, mmap-friendly |
| TensorRT Engine | NVIDIA production inference | NVIDIA only | Requires per-GPU compilation |
| JAX Checkpoints | TPU inference | TPU / GPU | XLA compilation |
| GGUF | CPU/edge inference | CPU, Apple Silicon | llama.cpp ecosystem |
Recommended Pattern
- Store canonical weights in SafeTensors
- CI/CD pipeline builds platform-specific artifacts:
- NVIDIA: TensorRT engine per GPU type (H100, H200)
- AMD: vLLM-compatible SafeTensors (no compilation needed)
- TPU: JAX checkpoint with XLA optimizations
- Version artifacts with model version + hardware target
2. Routing Strategies
SLO-Based Routing
Route to the fastest available hardware that meets the latency target.
Request: Llama-70B, SLO: TTFT < 500ms
Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool │ P50 TTFT │ Capacity │ Decision │
├───────────────┼───────────┼──────────┼─────────────────────────┤
│ H100 Cluster │ 180ms │ 85% │ ✓ Best option │
│ MI300X │ 220ms │ 60% │ ✓ Backup │
│ TPU v5e │ 450ms │ 40% │ ✓ Acceptable │
│ H200 (busy) │ 150ms │ 98% │ ✗ Over capacity │
└─────────────────────────────────────────────────────────────────┘
Route to: H100 Cluster (fastest with capacity)
Cost-Based Routing
Route to the cheapest option that meets the SLO.
Request: Batch inference, SLO: Complete within 1 hour
Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool │ $/1M tokens │ Meets SLO │ Decision │
├───────────────┼─────────────┼───────────┼───────────────────────┤
│ TPU v5e │ $0.80 │ ✓ │ ✓ Cheapest │
│ MI300X │ $1.20 │ ✓ │ Backup │
│ H100 │ $2.50 │ ✓ │ More expensive │
│ H200 │ $3.00 │ ✓ │ Premium │
└─────────────────────────────────────────────────────────────────┘
Route to: TPU v5e (cheapest meeting SLO)
Affinity-Based Routing
Some models perform better on specific hardware due to optimization or memory requirements.
| Model | Best Hardware | Reason |
|---|---|---|
| Llama 405B | MI300X | 256GB HBM fits model + large KV cache |
| Gemini (Google) | TPU v5p | Native JAX, optimized for TPU |
| GPT-4 (inference) | H200 | TensorRT-LLM optimizations |
| Mixtral 8x22B | H100 (TP=8) | Expert parallelism, NVLink bandwidth |
3. Hardware-Specific Optimizations
NVIDIA Optimizations
- TensorRT-LLM: Custom attention kernels, FP8 quantization, inflight batching
- NVLink: Use TP within NVLink domain (900 GB/s on H100)
- Multi-Instance GPU (MIG): Partition H100 for smaller models
- Flash Attention: Memory-efficient attention (standard in vLLM, TensorRT-LLM)
AMD Optimizations
- ROCm 7: 3.5x inference uplift vs ROCm 6
- vLLM: Full support with optimized Docker images
- Memory advantage: 256GB HBM3e enables larger batches, longer context
- Infinity Fabric: AMD's interconnect for multi-GPU
TPU Optimizations
- JAX/XLA: Compiler optimizations, automatic sharding
- ICI (Inter-Core Interconnect): 4800 Gbps on v5p for pod-scale
- SparseCore: 5-7x speedup for embedding operations
- MegaCore: Two cores share memory, operate as one
KV Cache Across Architectures
In a disaggregated or multi-architecture setup, KV cache transfer is critical.
Option 1: Stateless (Re-compute)
Each request is self-contained. If routed to different hardware, re-compute prefill.
- Pro: Simple, no cross-hardware dependencies
- Con: Wasted compute for multi-turn conversations
Option 2: Sticky Sessions
Pin conversation to same hardware pool for duration.
- Pro: KV cache stays local
- Con: Reduces routing flexibility, potential hotspots
Option 3: Distributed KV Store
Externalize KV cache to Redis/custom store, transfer on routing decision.
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED KV CACHE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Session A │ │ Session B │ │ Session C │ │
│ │ KV Cache │ │ KV Cache │ │ KV Cache │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────▼───────────────────▼───────────────────▼──────┐ │
│ │ RDMA-Enabled KV Store │ │
│ │ (Redis Cluster / Custom Solution) │ │
│ └──────┬───────────────────┬───────────────────┬──────┘ │
│ │ │ │ │
└───────────┼───────────────────┼───────────────────┼──────────────────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ NVIDIA Pool │ │ AMD Pool │ │ TPU Pool │
│ (Pull on need)│ │ (Pull on need)│ │ (Pull on need)│
└───────────────┘ └───────────────┘ └───────────────┘
KV Transfer Latency
For a 70B model with 32K context, KV cache is ~40GB. Transfer times:
- NVLink (900 GB/s): ~45ms
- InfiniBand (400 Gbps): ~800ms
- Ethernet (100 Gbps): ~3.2s
This is why disaggregation typically stays within high-bandwidth domains.
Real-World Examples
Microsoft Azure
- Azure OpenAI runs on both NVIDIA (H100/H200) and AMD (MI300X)
- Copilot workloads distributed across both
- Routing based on capacity and model requirements
Google Cloud
- TPU pods with ICI fabric for internal models (Gemini)
- NVIDIA GPUs available for customer workloads
- JAX/XLA provides hardware abstraction layer
Meta
- Custom MTIA (Meta Training and Inference Accelerator)
- NVIDIA GPUs for general compute
- Exploring Google TPU partnership (2026-2027)
- Target: 600K chip infrastructure
Interview Discussion Points
"How would you handle model versioning across architectures?"
- Canonical weights in SafeTensors (source of truth)
- CI/CD builds hardware-specific artifacts on model update
- Blue-green deployments per hardware pool
- Gradual rollout with traffic splitting
"What if NVIDIA has a supply shortage?"
- This is exactly why multi-architecture matters
- Automatic failover to AMD/TPU pools
- Cost model adjusts (may pay premium for available capacity)
- Long-term: Diversified purchasing strategy
"How do you ensure consistent output across hardware?"
- Same model weights, but floating-point differences exist
- Evaluation suite runs on all hardware targets
- Accept small numerical differences (FP16/FP8 variance)
- Critical: Temperature=0 may still differ slightly