← Back to Interview Prep

AI Accelerator Hardware Comparison

Comprehensive specs for Google TPU, NVIDIA GPU, and AMD GPU families. Memory, compute, interconnects, and when to use each.

Quick Comparison: Current Generation

Spec	NVIDIA H100	NVIDIA H200	AMD MI300X	Google TPU v5p
HBM Capacity	80 GB	141 GB	256 GB	95 GB
Memory Bandwidth	3.35 TB/s	4.89 TB/s	6.0 TB/s	4.8 TB/s
FP16 TFLOPS	989	~990	1,300	459
FP8 TFLOPS	1,979	~1,980	2,600	N/A
BF16 TFLOPS	989	~990	1,300	459 (optimized)
Interconnect	NVLink 900 GB/s	NVLink 900 GB/s	Infinity Fabric	ICI 4,800 Gbps
TDP	700W	700W	750W	~450W
Best For	General LLM	Large context	Memory-bound	Scale-out training

NVIDIA GPU Family

Hopper Architecture (H100, H200)

Spec	H100 SXM	H100 PCIe	H200 SXM NEW
HBM Type	HBM3	HBM3	HBM3e
Memory	80 GB	80 GB	141 GB
Bandwidth	3.35 TB/s	2.0 TB/s	4.89 TB/s
FP16 Tensor	989 TFLOPS	756 TFLOPS	~990 TFLOPS
FP8 Tensor	1,979 TFLOPS	1,513 TFLOPS	~1,980 TFLOPS
NVLink	900 GB/s (18 links)	N/A	900 GB/s
Transistors	80B	80B	80B

Blackwell Architecture (B100, B200, GB200) 2025

Spec	B100	B200	GB200 NVL72 FLAGSHIP
Architecture	Blackwell - Dual-die design with 10 TB/s chip-to-chip
Transistors	208B	208B	208B per GPU
Memory	192 GB HBM3e	192 GB HBM3e	192 GB × 72 = 13.8 TB
Bandwidth	6.0 TB/s	8.0 TB/s	8.0 TB/s per GPU
FP8 Tensor	3,500 TFLOPS	4,500 TFLOPS	4,500 × 72 TFLOPS
FP4 Tensor	7,000 TFLOPS	9,000 TFLOPS	Native support
NVLink	1.8 TB/s	1.8 TB/s	NVLink 5: 130 TB/s rack
LLM Throughput	11-15x faster than Hopper (H100)

NVIDIA Ecosystem Advantages

CUDA: Mature ecosystem, extensive libraries, developer tools
TensorRT-LLM: Optimized inference with FP8/FP4, inflight batching
Triton Inference Server: Multi-model serving, dynamic batching
NVSwitch: 3.6 TB/s across 8 GPUs in HGX systems

AMD Instinct Family

Spec	MI250X	MI300X CURRENT	MI325X	MI350 JUN 2025
Architecture	CDNA 2	CDNA 3	CDNA 3	CDNA 4
Memory	128 GB HBM2e	256 GB HBM3e	256 GB HBM3e	288 GB HBM3e
Bandwidth	3.2 TB/s	6.0 TB/s	6.0 TB/s	TBD
FP16 TFLOPS	383	1,300	1,300	TBD
FP32 TFLOPS	47.9	163.4	163.4	TBD
TDP	560W	750W	750W	TBD
Chiplets	2 GCDs	8 XCDs + 4 IODs	8 XCDs + 4 IODs	TBD

AMD Advantages & Production Use

Memory Leadership: 256 GB HBM3e = 3x H100 capacity, fits larger batches and longer context
Microsoft Azure: Azure OpenAI (GPT-3.5/4), Copilot running on MI300X clusters
Meta: Planning 600K chip deployment including MI300X
ROCm 7: 3.5x inference uplift vs ROCm 6
vLLM: Full support with optimized Docker images

MI300X vs H100/H200 for LLM Inference

Metric	MI300X	H100	H200	Winner
KV Cache Capacity	256 GB	80 GB	141 GB	MI300X (3x H100)
Memory Bandwidth	6.0 TB/s	3.35 TB/s	4.89 TB/s	MI300X
Raw Compute (FP16)	1,300 TFLOPS	989 TFLOPS	~990 TFLOPS	MI300X
Software Ecosystem	ROCm	CUDA	CUDA	NVIDIA
Optimized Kernels	Growing	Extensive	Extensive	NVIDIA

Google TPU Family

Spec	TPU v4	TPU v5e	TPU v5p TRAINING	Trillium (v6)	Ironwood (v7) Q4 2025
Use Case	General	Inference	Training	General	Inference
Chips per Pod	4,096	256	8,960	TBD	TBD
BF16 TFLOPS	275	197	459	~900	4,614
INT8 TOPS	275	393	918	TBD	TBD
HBM per Chip	32 GB	16 GB	95 GB	32 GB	192 GB
HBM Bandwidth	1.2 TB/s	1.6 TB/s	4.8 TB/s	TBD	TBD
ICI Bandwidth	400 Gbps	400 Gbps	4,800 Gbps	Improved	TBD
Topology	3D Torus	2D Torus	3D Torus	3D Torus	TBD
Improvement	Baseline	2.5x throughput/$	2.8x vs v4	4x training	100% perf/watt

TPU Unique Features

MegaCore: Two TPU cores share memory, operate as one large accelerator (2x FLOPs)
SparseCore: 4 dedicated processors per chip for embeddings (5-7x speedup, uses only 5% die area)
JAX/XLA: Tight integration, automatic sharding, compiler optimizations
ICI Fabric: Direct chip-to-chip connectivity enables massive pod scale
Cost: 65% lower inference costs reported (Midjourney case study)

TPU v5e vs v5p: Choosing the Right One

Factor	TPU v5e	TPU v5p
Optimized For	Inference (cost-efficient)	Training (max performance)
Cores per Chip	1	2 (MegaCore)
Memory per Chip	16 GB	95 GB
Cost Efficiency	Best $/token	Premium
Scale	256 chips max	8,960 chips max
Use Case	Production inference	Large model training, research

Decision Framework: Which Hardware?

Scenario	Best Choice	Reason
70B model, 128K context	MI300X	256 GB HBM fits model + large KV cache
Production inference, cost-sensitive	TPU v5e	Best $/token for inference
Low latency, general LLM	H100	Mature ecosystem, TensorRT-LLM optimizations
Very large model training	TPU v5p	8,960 chip pods, ICI fabric for scale
MoE model (Mixtral)	H100 (TP=8)	NVLink bandwidth for expert routing
Maximum throughput 2025+	Blackwell B200	11-15x improvement over Hopper
Azure OpenAI workloads	MI300X	Already deployed for GPT-3.5/4

Key Metrics for LLM Serving

Why Memory Bandwidth Matters More Than Compute

LLM inference (especially decode phase) is memory-bandwidth bound, not compute-bound.

Each token generation reads entire model weights from HBM
70B model at FP16 = 140 GB read per token
At 3.35 TB/s (H100): ~24 tokens/sec theoretical max
At 6.0 TB/s (MI300X): ~43 tokens/sec theoretical max

Rule of thumb: For decode-heavy workloads, prioritize memory bandwidth over raw FLOPS.

Memory Capacity for KV Cache

For 70B Llama with 80 layers, 64 heads, 128 head_dim, FP16:

Per-token KV: 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB
32K context: ~84 GB per sequence
H100 (80 GB): Model + 1 sequence barely fits
MI300X (256 GB): Model + 3-4 sequences comfortably