Quick Comparison: Current Generation
| Spec | NVIDIA H100 | NVIDIA H200 | AMD MI300X | Google TPU v5p |
|---|---|---|---|---|
| HBM Capacity | 80 GB | 141 GB | 256 GB | 95 GB |
| Memory Bandwidth | 3.35 TB/s | 4.89 TB/s | 6.0 TB/s | 4.8 TB/s |
| FP16 TFLOPS | 989 | ~990 | 1,300 | 459 |
| FP8 TFLOPS | 1,979 | ~1,980 | 2,600 | N/A |
| BF16 TFLOPS | 989 | ~990 | 1,300 | 459 (optimized) |
| Interconnect | NVLink 900 GB/s | NVLink 900 GB/s | Infinity Fabric | ICI 4,800 Gbps |
| TDP | 700W | 700W | 750W | ~450W |
| Best For | General LLM | Large context | Memory-bound | Scale-out training |
NVIDIA GPU Family
Hopper Architecture (H100, H200)
| Spec | H100 SXM | H100 PCIe | H200 SXM NEW |
|---|---|---|---|
| HBM Type | HBM3 | HBM3 | HBM3e |
| Memory | 80 GB | 80 GB | 141 GB |
| Bandwidth | 3.35 TB/s | 2.0 TB/s | 4.89 TB/s |
| FP16 Tensor | 989 TFLOPS | 756 TFLOPS | ~990 TFLOPS |
| FP8 Tensor | 1,979 TFLOPS | 1,513 TFLOPS | ~1,980 TFLOPS |
| NVLink | 900 GB/s (18 links) | N/A | 900 GB/s |
| Transistors | 80B | 80B | 80B |
Blackwell Architecture (B100, B200, GB200) 2025
| Spec | B100 | B200 | GB200 NVL72 FLAGSHIP |
|---|---|---|---|
| Architecture | Blackwell - Dual-die design with 10 TB/s chip-to-chip | ||
| Transistors | 208B | 208B | 208B per GPU |
| Memory | 192 GB HBM3e | 192 GB HBM3e | 192 GB × 72 = 13.8 TB |
| Bandwidth | 6.0 TB/s | 8.0 TB/s | 8.0 TB/s per GPU |
| FP8 Tensor | 3,500 TFLOPS | 4,500 TFLOPS | 4,500 × 72 TFLOPS |
| FP4 Tensor | 7,000 TFLOPS | 9,000 TFLOPS | Native support |
| NVLink | 1.8 TB/s | 1.8 TB/s | NVLink 5: 130 TB/s rack |
| LLM Throughput | 11-15x faster than Hopper (H100) | ||
NVIDIA Ecosystem Advantages
- CUDA: Mature ecosystem, extensive libraries, developer tools
- TensorRT-LLM: Optimized inference with FP8/FP4, inflight batching
- Triton Inference Server: Multi-model serving, dynamic batching
- NVSwitch: 3.6 TB/s across 8 GPUs in HGX systems
AMD Instinct Family
| Spec | MI250X | MI300X CURRENT | MI325X | MI350 JUN 2025 |
|---|---|---|---|---|
| Architecture | CDNA 2 | CDNA 3 | CDNA 3 | CDNA 4 |
| Memory | 128 GB HBM2e | 256 GB HBM3e | 256 GB HBM3e | 288 GB HBM3e |
| Bandwidth | 3.2 TB/s | 6.0 TB/s | 6.0 TB/s | TBD |
| FP16 TFLOPS | 383 | 1,300 | 1,300 | TBD |
| FP32 TFLOPS | 47.9 | 163.4 | 163.4 | TBD |
| TDP | 560W | 750W | 750W | TBD |
| Chiplets | 2 GCDs | 8 XCDs + 4 IODs | 8 XCDs + 4 IODs | TBD |
AMD Advantages & Production Use
- Memory Leadership: 256 GB HBM3e = 3x H100 capacity, fits larger batches and longer context
- Microsoft Azure: Azure OpenAI (GPT-3.5/4), Copilot running on MI300X clusters
- Meta: Planning 600K chip deployment including MI300X
- ROCm 7: 3.5x inference uplift vs ROCm 6
- vLLM: Full support with optimized Docker images
MI300X vs H100/H200 for LLM Inference
| Metric | MI300X | H100 | H200 | Winner |
|---|---|---|---|---|
| KV Cache Capacity | 256 GB | 80 GB | 141 GB | MI300X (3x H100) |
| Memory Bandwidth | 6.0 TB/s | 3.35 TB/s | 4.89 TB/s | MI300X |
| Raw Compute (FP16) | 1,300 TFLOPS | 989 TFLOPS | ~990 TFLOPS | MI300X |
| Software Ecosystem | ROCm | CUDA | CUDA | NVIDIA |
| Optimized Kernels | Growing | Extensive | Extensive | NVIDIA |
Google TPU Family
| Spec | TPU v4 | TPU v5e | TPU v5p TRAINING | Trillium (v6) | Ironwood (v7) Q4 2025 |
|---|---|---|---|---|---|
| Use Case | General | Inference | Training | General | Inference |
| Chips per Pod | 4,096 | 256 | 8,960 | TBD | TBD |
| BF16 TFLOPS | 275 | 197 | 459 | ~900 | 4,614 |
| INT8 TOPS | 275 | 393 | 918 | TBD | TBD |
| HBM per Chip | 32 GB | 16 GB | 95 GB | 32 GB | 192 GB |
| HBM Bandwidth | 1.2 TB/s | 1.6 TB/s | 4.8 TB/s | TBD | TBD |
| ICI Bandwidth | 400 Gbps | 400 Gbps | 4,800 Gbps | Improved | TBD |
| Topology | 3D Torus | 2D Torus | 3D Torus | 3D Torus | TBD |
| Improvement | Baseline | 2.5x throughput/$ | 2.8x vs v4 | 4x training | 100% perf/watt |
TPU Unique Features
- MegaCore: Two TPU cores share memory, operate as one large accelerator (2x FLOPs)
- SparseCore: 4 dedicated processors per chip for embeddings (5-7x speedup, uses only 5% die area)
- JAX/XLA: Tight integration, automatic sharding, compiler optimizations
- ICI Fabric: Direct chip-to-chip connectivity enables massive pod scale
- Cost: 65% lower inference costs reported (Midjourney case study)
TPU v5e vs v5p: Choosing the Right One
| Factor | TPU v5e | TPU v5p |
|---|---|---|
| Optimized For | Inference (cost-efficient) | Training (max performance) |
| Cores per Chip | 1 | 2 (MegaCore) |
| Memory per Chip | 16 GB | 95 GB |
| Cost Efficiency | Best $/token | Premium |
| Scale | 256 chips max | 8,960 chips max |
| Use Case | Production inference | Large model training, research |
Decision Framework: Which Hardware?
| Scenario | Best Choice | Reason |
|---|---|---|
| 70B model, 128K context | MI300X | 256 GB HBM fits model + large KV cache |
| Production inference, cost-sensitive | TPU v5e | Best $/token for inference |
| Low latency, general LLM | H100 | Mature ecosystem, TensorRT-LLM optimizations |
| Very large model training | TPU v5p | 8,960 chip pods, ICI fabric for scale |
| MoE model (Mixtral) | H100 (TP=8) | NVLink bandwidth for expert routing |
| Maximum throughput 2025+ | Blackwell B200 | 11-15x improvement over Hopper |
| Azure OpenAI workloads | MI300X | Already deployed for GPT-3.5/4 |
Key Metrics for LLM Serving
Why Memory Bandwidth Matters More Than Compute
LLM inference (especially decode phase) is memory-bandwidth bound, not compute-bound.
- Each token generation reads entire model weights from HBM
- 70B model at FP16 = 140 GB read per token
- At 3.35 TB/s (H100): ~24 tokens/sec theoretical max
- At 6.0 TB/s (MI300X): ~43 tokens/sec theoretical max
Rule of thumb: For decode-heavy workloads, prioritize memory bandwidth over raw FLOPS.
Memory Capacity for KV Cache
For 70B Llama with 80 layers, 64 heads, 128 head_dim, FP16:
- Per-token KV: 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB
- 32K context: ~84 GB per sequence
- H100 (80 GB): Model + 1 sequence barely fits
- MI300X (256 GB): Model + 3-4 sequences comfortably