Multi-Architecture LLM Serving | TPU + NVIDIA + AMD

The Challenge

Modern cloud providers and large enterprises operate heterogeneous AI infrastructure:

Microsoft Azure: NVIDIA H100/H200 + AMD MI300X for Azure OpenAI
Google Cloud: TPU v5e/v5p + NVIDIA GPUs
Meta: Custom MTIA + NVIDIA GPUs + exploring TPU partnership

How do you design a serving system that abstracts hardware differences while maximizing efficiency?

Reference Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           UNIFIED API GATEWAY                                │
│                                                                              │
│    • OpenAI-compatible API        • Model routing                           │
│    • Token counting               • Request validation                      │
│    • Authentication               • Rate limiting                           │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │
┌─────────────────────────────────────▼───────────────────────────────────────┐
│                         INTELLIGENT ROUTER                                   │
│                                                                              │
│    Routing Policies:                                                         │
│    ┌─────────────────────────────────────────────────────────────────────┐  │
│    │ • SLO-based: Route to fastest hardware meeting latency target       │  │
│    │ • Cost-based: Route to cheapest option meeting SLO                  │  │
│    │ • Affinity-based: Specific models → specific hardware               │  │
│    │ • Load-based: Balance across available capacity                     │  │
│    └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│    Model Registry:                                                           │
│    ┌───────────────┬─────────────────┬───────────────────────────────────┐  │
│    │ Model         │ Formats         │ Hardware Affinity                 │  │
│    ├───────────────┼─────────────────┼───────────────────────────────────┤  │
│    │ Llama-70B     │ SafeTensors,    │ NVIDIA (TensorRT), AMD (vLLM),   │  │
│    │               │ TensorRT, JAX   │ TPU (JAX)                         │  │
│    └───────────────┴─────────────────┴───────────────────────────────────┘  │
└───────┬─────────────────────────┬─────────────────────────┬─────────────────┘
        │                         │                         │
        ▼                         ▼                         ▼
┌───────────────────┐   ┌───────────────────┐   ┌───────────────────┐
│   GOOGLE TPU     │   │   NVIDIA GPU     │   │   AMD GPU        │
│   POOL            │   │   POOL            │   │   POOL            │
│                   │   │                   │   │                   │
│ ┌───────────────┐ │   │ ┌───────────────┐ │   │ ┌───────────────┐ │
│ │ TPU v5e Pod   │ │   │ │ H100 Cluster  │ │   │ │ MI300X Cluster│ │
│ │ (Inference)   │ │   │ │ (General)     │ │   │ │ (High Memory) │ │
│ └───────────────┘ │   │ └───────────────┘ │   │ └───────────────┘ │
│                   │   │                   │   │                   │
│ ┌───────────────┐ │   │ ┌───────────────┐ │   │ ┌───────────────┐ │
│ │ TPU v5p Pod   │ │   │ │ H200 Cluster  │ │   │ │ MI325X Cluster│ │
│ │ (Training)    │ │   │ │ (Large Models)│ │   │ │ (Future)      │ │
│ └───────────────┘ │   │ └───────────────┘ │   │ └───────────────┘ │
│                   │   │                   │   │                   │
│ Framework: JAX    │   │ Framework:        │   │ Framework:        │
│ Runtime: XLA      │   │ TensorRT-LLM      │   │ vLLM + ROCm       │
└───────────────────┘   └───────────────────┘   └───────────────────┘
        │                         │                         │
        └─────────────────────────┼─────────────────────────┘
                                  │
┌─────────────────────────────────▼───────────────────────────────────────────┐
│                      UNIFIED OBSERVABILITY                                   │
│                                                                              │
│    • Tokens/sec per hardware type    • Cost per 1M tokens                   │
│    • TTFT / ITL distributions        • Utilization metrics                  │
│    • Error rates by pool             • Capacity planning                    │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions

1. Model Format Strategy

Format	Use Case	Hardware	Notes
SafeTensors	Canonical storage, training	Universal	Safe, fast loading, mmap-friendly
TensorRT Engine	NVIDIA production inference	NVIDIA only	Requires per-GPU compilation
JAX Checkpoints	TPU inference	TPU / GPU	XLA compilation
GGUF	CPU/edge inference	CPU, Apple Silicon	llama.cpp ecosystem

Recommended Pattern

Store canonical weights in SafeTensors
CI/CD pipeline builds platform-specific artifacts:
- NVIDIA: TensorRT engine per GPU type (H100, H200)
- AMD: vLLM-compatible SafeTensors (no compilation needed)
- TPU: JAX checkpoint with XLA optimizations
Version artifacts with model version + hardware target

2. Routing Strategies

SLO-Based Routing

Route to the fastest available hardware that meets the latency target.

Request: Llama-70B, SLO: TTFT < 500ms

Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool          │ P50 TTFT  │ Capacity │ Decision                │
├───────────────┼───────────┼──────────┼─────────────────────────┤
│ H100 Cluster  │ 180ms     │ 85%      │ ✓ Best option           │
│ MI300X        │ 220ms     │ 60%      │ ✓ Backup                │
│ TPU v5e       │ 450ms     │ 40%      │ ✓ Acceptable            │
│ H200 (busy)   │ 150ms     │ 98%      │ ✗ Over capacity         │
└─────────────────────────────────────────────────────────────────┘

Route to: H100 Cluster (fastest with capacity)

Cost-Based Routing

Route to the cheapest option that meets the SLO.

Request: Batch inference, SLO: Complete within 1 hour

Router checks:
┌─────────────────────────────────────────────────────────────────┐
│ Pool          │ $/1M tokens │ Meets SLO │ Decision              │
├───────────────┼─────────────┼───────────┼───────────────────────┤
│ TPU v5e       │ $0.80       │ ✓         │ ✓ Cheapest            │
│ MI300X        │ $1.20       │ ✓         │   Backup              │
│ H100          │ $2.50       │ ✓         │   More expensive      │
│ H200          │ $3.00       │ ✓         │   Premium             │
└─────────────────────────────────────────────────────────────────┘

Route to: TPU v5e (cheapest meeting SLO)

Affinity-Based Routing

Some models perform better on specific hardware due to optimization or memory requirements.

Model	Best Hardware	Reason
Llama 405B	MI300X	256GB HBM fits model + large KV cache
Gemini (Google)	TPU v5p	Native JAX, optimized for TPU
GPT-4 (inference)	H200	TensorRT-LLM optimizations
Mixtral 8x22B	H100 (TP=8)	Expert parallelism, NVLink bandwidth

3. Hardware-Specific Optimizations

NVIDIA Optimizations

TensorRT-LLM: Custom attention kernels, FP8 quantization, inflight batching
NVLink: Use TP within NVLink domain (900 GB/s on H100)
Multi-Instance GPU (MIG): Partition H100 for smaller models
Flash Attention: Memory-efficient attention (standard in vLLM, TensorRT-LLM)

AMD Optimizations

ROCm 7: 3.5x inference uplift vs ROCm 6
vLLM: Full support with optimized Docker images
Memory advantage: 256GB HBM3e enables larger batches, longer context
Infinity Fabric: AMD's interconnect for multi-GPU

TPU Optimizations

JAX/XLA: Compiler optimizations, automatic sharding
ICI (Inter-Core Interconnect): 4800 Gbps on v5p for pod-scale
SparseCore: 5-7x speedup for embedding operations
MegaCore: Two cores share memory, operate as one

KV Cache Across Architectures

In a disaggregated or multi-architecture setup, KV cache transfer is critical.

Option 1: Stateless (Re-compute)

Each request is self-contained. If routed to different hardware, re-compute prefill.

Pro: Simple, no cross-hardware dependencies
Con: Wasted compute for multi-turn conversations

Option 2: Sticky Sessions

Pin conversation to same hardware pool for duration.

Pro: KV cache stays local
Con: Reduces routing flexibility, potential hotspots

Option 3: Distributed KV Store

Externalize KV cache to Redis/custom store, transfer on routing decision.

┌─────────────────────────────────────────────────────────────────────────────┐
│                          DISTRIBUTED KV CACHE                                │
│                                                                              │
│    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                  │
│    │  Session A  │     │  Session B  │     │  Session C  │                  │
│    │  KV Cache   │     │  KV Cache   │     │  KV Cache   │                  │
│    └──────┬──────┘     └──────┬──────┘     └──────┬──────┘                  │
│           │                   │                   │                          │
│    ┌──────▼───────────────────▼───────────────────▼──────┐                  │
│    │              RDMA-Enabled KV Store                   │                  │
│    │         (Redis Cluster / Custom Solution)            │                  │
│    └──────┬───────────────────┬───────────────────┬──────┘                  │
│           │                   │                   │                          │
└───────────┼───────────────────┼───────────────────┼──────────────────────────┘
            │                   │                   │
            ▼                   ▼                   ▼
    ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
    │ NVIDIA Pool   │   │  AMD Pool     │   │  TPU Pool     │
    │ (Pull on need)│   │ (Pull on need)│   │ (Pull on need)│
    └───────────────┘   └───────────────┘   └───────────────┘

KV Transfer Latency

For a 70B model with 32K context, KV cache is ~40GB. Transfer times:

NVLink (900 GB/s): ~45ms
InfiniBand (400 Gbps): ~800ms
Ethernet (100 Gbps): ~3.2s

This is why disaggregation typically stays within high-bandwidth domains.

Real-World Examples

Microsoft Azure

Azure OpenAI runs on both NVIDIA (H100/H200) and AMD (MI300X)
Copilot workloads distributed across both
Routing based on capacity and model requirements

Google Cloud

TPU pods with ICI fabric for internal models (Gemini)
NVIDIA GPUs available for customer workloads
JAX/XLA provides hardware abstraction layer

Interview Discussion Points

"How would you handle model versioning across architectures?"

Canonical weights in SafeTensors (source of truth)
CI/CD builds hardware-specific artifacts on model update
Blue-green deployments per hardware pool
Gradual rollout with traffic splitting

"What if NVIDIA has a supply shortage?"

This is exactly why multi-architecture matters
Automatic failover to AMD/TPU pools
Cost model adjusts (may pay premium for available capacity)
Long-term: Diversified purchasing strategy

"How do you ensure consistent output across hardware?"

Same model weights, but floating-point differences exist
Evaluation suite runs on all hardware targets
Accept small numerical differences (FP16/FP8 variance)
Critical: Temperature=0 may still differ slightly