← Back to Interview Prep

AI Accelerator Hardware Comparison

Comprehensive specs for Google TPU, NVIDIA GPU, and AMD GPU families. Memory, compute, interconnects, and when to use each.

Quick Comparison: Current Generation

Spec NVIDIA H100 NVIDIA H200 AMD MI300X Google TPU v5p
HBM Capacity 80 GB 141 GB 256 GB 95 GB
Memory Bandwidth 3.35 TB/s 4.89 TB/s 6.0 TB/s 4.8 TB/s
FP16 TFLOPS 989 ~990 1,300 459
FP8 TFLOPS 1,979 ~1,980 2,600 N/A
BF16 TFLOPS 989 ~990 1,300 459 (optimized)
Interconnect NVLink 900 GB/s NVLink 900 GB/s Infinity Fabric ICI 4,800 Gbps
TDP 700W 700W 750W ~450W
Best For General LLM Large context Memory-bound Scale-out training

NVIDIA GPU Family

Hopper Architecture (H100, H200)

Spec H100 SXM H100 PCIe H200 SXM NEW
HBM Type HBM3 HBM3 HBM3e
Memory 80 GB 80 GB 141 GB
Bandwidth 3.35 TB/s 2.0 TB/s 4.89 TB/s
FP16 Tensor 989 TFLOPS 756 TFLOPS ~990 TFLOPS
FP8 Tensor 1,979 TFLOPS 1,513 TFLOPS ~1,980 TFLOPS
NVLink 900 GB/s (18 links) N/A 900 GB/s
Transistors 80B 80B 80B

Blackwell Architecture (B100, B200, GB200) 2025

Spec B100 B200 GB200 NVL72 FLAGSHIP
Architecture Blackwell - Dual-die design with 10 TB/s chip-to-chip
Transistors 208B 208B 208B per GPU
Memory 192 GB HBM3e 192 GB HBM3e 192 GB × 72 = 13.8 TB
Bandwidth 6.0 TB/s 8.0 TB/s 8.0 TB/s per GPU
FP8 Tensor 3,500 TFLOPS 4,500 TFLOPS 4,500 × 72 TFLOPS
FP4 Tensor 7,000 TFLOPS 9,000 TFLOPS Native support
NVLink 1.8 TB/s 1.8 TB/s NVLink 5: 130 TB/s rack
LLM Throughput 11-15x faster than Hopper (H100)
NVIDIA Ecosystem Advantages
  • CUDA: Mature ecosystem, extensive libraries, developer tools
  • TensorRT-LLM: Optimized inference with FP8/FP4, inflight batching
  • Triton Inference Server: Multi-model serving, dynamic batching
  • NVSwitch: 3.6 TB/s across 8 GPUs in HGX systems

AMD Instinct Family

Spec MI250X MI300X CURRENT MI325X MI350 JUN 2025
Architecture CDNA 2 CDNA 3 CDNA 3 CDNA 4
Memory 128 GB HBM2e 256 GB HBM3e 256 GB HBM3e 288 GB HBM3e
Bandwidth 3.2 TB/s 6.0 TB/s 6.0 TB/s TBD
FP16 TFLOPS 383 1,300 1,300 TBD
FP32 TFLOPS 47.9 163.4 163.4 TBD
TDP 560W 750W 750W TBD
Chiplets 2 GCDs 8 XCDs + 4 IODs 8 XCDs + 4 IODs TBD
AMD Advantages & Production Use
  • Memory Leadership: 256 GB HBM3e = 3x H100 capacity, fits larger batches and longer context
  • Microsoft Azure: Azure OpenAI (GPT-3.5/4), Copilot running on MI300X clusters
  • Meta: Planning 600K chip deployment including MI300X
  • ROCm 7: 3.5x inference uplift vs ROCm 6
  • vLLM: Full support with optimized Docker images

MI300X vs H100/H200 for LLM Inference

Metric MI300X H100 H200 Winner
KV Cache Capacity 256 GB 80 GB 141 GB MI300X (3x H100)
Memory Bandwidth 6.0 TB/s 3.35 TB/s 4.89 TB/s MI300X
Raw Compute (FP16) 1,300 TFLOPS 989 TFLOPS ~990 TFLOPS MI300X
Software Ecosystem ROCm CUDA CUDA NVIDIA
Optimized Kernels Growing Extensive Extensive NVIDIA

Google TPU Family

Spec TPU v4 TPU v5e TPU v5p TRAINING Trillium (v6) Ironwood (v7) Q4 2025
Use Case General Inference Training General Inference
Chips per Pod 4,096 256 8,960 TBD TBD
BF16 TFLOPS 275 197 459 ~900 4,614
INT8 TOPS 275 393 918 TBD TBD
HBM per Chip 32 GB 16 GB 95 GB 32 GB 192 GB
HBM Bandwidth 1.2 TB/s 1.6 TB/s 4.8 TB/s TBD TBD
ICI Bandwidth 400 Gbps 400 Gbps 4,800 Gbps Improved TBD
Topology 3D Torus 2D Torus 3D Torus 3D Torus TBD
Improvement Baseline 2.5x throughput/$ 2.8x vs v4 4x training 100% perf/watt
TPU Unique Features
  • MegaCore: Two TPU cores share memory, operate as one large accelerator (2x FLOPs)
  • SparseCore: 4 dedicated processors per chip for embeddings (5-7x speedup, uses only 5% die area)
  • JAX/XLA: Tight integration, automatic sharding, compiler optimizations
  • ICI Fabric: Direct chip-to-chip connectivity enables massive pod scale
  • Cost: 65% lower inference costs reported (Midjourney case study)

TPU v5e vs v5p: Choosing the Right One

Factor TPU v5e TPU v5p
Optimized For Inference (cost-efficient) Training (max performance)
Cores per Chip 1 2 (MegaCore)
Memory per Chip 16 GB 95 GB
Cost Efficiency Best $/token Premium
Scale 256 chips max 8,960 chips max
Use Case Production inference Large model training, research

Decision Framework: Which Hardware?

Scenario Best Choice Reason
70B model, 128K context MI300X 256 GB HBM fits model + large KV cache
Production inference, cost-sensitive TPU v5e Best $/token for inference
Low latency, general LLM H100 Mature ecosystem, TensorRT-LLM optimizations
Very large model training TPU v5p 8,960 chip pods, ICI fabric for scale
MoE model (Mixtral) H100 (TP=8) NVLink bandwidth for expert routing
Maximum throughput 2025+ Blackwell B200 11-15x improvement over Hopper
Azure OpenAI workloads MI300X Already deployed for GPT-3.5/4

Key Metrics for LLM Serving

Why Memory Bandwidth Matters More Than Compute

LLM inference (especially decode phase) is memory-bandwidth bound, not compute-bound.

  • Each token generation reads entire model weights from HBM
  • 70B model at FP16 = 140 GB read per token
  • At 3.35 TB/s (H100): ~24 tokens/sec theoretical max
  • At 6.0 TB/s (MI300X): ~43 tokens/sec theoretical max

Rule of thumb: For decode-heavy workloads, prioritize memory bandwidth over raw FLOPS.

Memory Capacity for KV Cache

For 70B Llama with 80 layers, 64 heads, 128 head_dim, FP16:

  • Per-token KV: 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB
  • 32K context: ~84 GB per sequence
  • H100 (80 GB): Model + 1 sequence barely fits
  • MI300X (256 GB): Model + 3-4 sequences comfortably