LLM Inference Systems

Systems Design Cheat Sheet

Architecture Overview

CLIENTS 1000s concurrent SSE streaming API GATEWAY Auth / Rate Limit Model Routing Load Balancing Circuit Breaker QUEUE LAYER Model-A Queue Model-B Queue Model-N Queue Priority + FIFO SCHEDULER Continuous Batching Chunked Prefill Prefill/Decode Split KV-Cache Preemption Adaptive Batch Size GPU INFERENCE CLUSTER Triton Inference Server — Orchestration vLLM — PagedAttention + Scheduling TensorRT-LLM — Optimized CUDA Kernels GPU 0 GPU 1 GPU 2 GPU N OBSERVABILITY — Prometheus + DCGM + OpenTelemetry + Burn-Rate Alerts TTFT P99 · Tokens/sec · GPU Util · Queue Depth · Error Budget · Synthetic Probes AUTOSCALER — Per-Model GPU Pool Scaling · Queue Depth + Latency P99 Triggers Canary Deploys · Multi-AZ · Spot/Preemptible Overflow · Model Weight Caching (LRU)
Q1 — System Design: Multi-Model LLM Serving

⬡ API Gateway

⬡ Queue & Scheduling

⬡ GPU Inference

Inference Stack Layers
LayerToolRoleKey Feature
Orchestration Triton Fleet management — model versioning, A/B routing, multi-model mux, health checks Model-agnostic, serves LLMs + vision + ensembles
Serving vLLM Request scheduling — continuous batching, KV-cache management PagedAttention — virtual memory for KV-cache
Engine TensorRT-LLM Optimized execution — kernel fusion, quantization, custom CUDA FP8/INT4 quantization, NVIDIA-specific optimization
Q2 — Scheduling & Batching for 200ms TTFT

⬡ Static vs Dynamic Batching

StaticDynamic / Continuous
MechanismWait for N reqs, pad to max lenJoin batch each decode step
GPU WasteHigh — padding tokensMinimal — no padding
TTFTBad — head-of-line blockingGood — immediate admission
ThroughputModerateHigh — slots freed immediately
VerdictContinuous batching required for heterogeneous lengths + strict TTFT

⬡ TTFT Budget Breakdown

TTFT ≈ queue_wait + prefill_time
Q3 — Monitoring, Alerting & High Availability

⬡ Key Metrics

All segmented per-model, per-GPU

⬡ Instrumentation Stack

⬡ High Availability

⬡ Alert Strategy — SLO Burn-Rate Based

🔴 Page (Immediate)

  • GPU Xid errors (hardware fault)
  • OOM kills on inference nodes
  • Serving process crash
  • TTFT P99 > 2× SLO target

🟡 Warning (Ticket)

  • Queue depth sustained high
  • GPU util < 30% (waste) or > 95% (saturation)
  • Memory fragmentation trending up
  • Error budget burning 10× sustainable rate

🔵 Anomaly Detection

  • Statistical deviation in tokens/sec per model
  • Catches silent model corruption
  • Degraded GPU without hard faults
  • Continuous synthetic probe regression