LLM Inference Systems — Cheat Sheet

Architecture Overview

Q1 — System Design: Multi-Model LLM Serving

⬡ API Gateway

Auth, rate limiting, request validation
Route by requested model to per-model queues
Circuit breaker sheds load on SLO breach (503 + retry-after)
SSE streaming back to client

⬡ Queue & Scheduling

Per-model priority queues (Kafka) decouple ingestion from inference
Scheduler bins requests, applies continuous batching
Short-context requests fast-tracked for TTFT
KV-cache preemption: swap low-priority to CPU/NVMe

⬡ GPU Inference

Cluster partitioned by model, served via vLLM / TensorRT-LLM / Triton
PagedAttention for KV-cache memory efficiency
Tensor parallelism (TP=4/8) for large models
Autoscale pools per queue depth + P99 latency

Inference Stack Layers

Layer	Tool	Role	Key Feature
Orchestration	Triton	Fleet management — model versioning, A/B routing, multi-model mux, health checks	Model-agnostic, serves LLMs + vision + ensembles
Serving	vLLM	Request scheduling — continuous batching, KV-cache management	PagedAttention — virtual memory for KV-cache
Engine	TensorRT-LLM	Optimized execution — kernel fusion, quantization, custom CUDA	FP8/INT4 quantization, NVIDIA-specific optimization

Q2 — Scheduling & Batching for 200ms TTFT

⬡ Static vs Dynamic Batching

	Static	Dynamic / Continuous
Mechanism	Wait for N reqs, pad to max len	Join batch each decode step
GPU Waste	High — padding tokens	Minimal — no padding
TTFT	Bad — head-of-line blocking	Good — immediate admission
Throughput	Moderate	High — slots freed immediately
Verdict	Continuous batching required for heterogeneous lengths + strict TTFT

⬡ TTFT Budget Breakdown

TTFT ≈ queue_wait + prefill_time

Queue budget: ~50ms max wait
Prefill budget: ~150ms for compute
Chunked prefill — break long contexts into chunks, interleave with decode iterations
Prefill/decode separation — run on separate GPU pools or careful interleaving
Overflow: inputs exceeding 150ms prefill → route to higher TP configs (TP=4/8)
Speculative decoding — helps inter-token latency, not TTFT

Q3 — Monitoring, Alerting & High Availability

⬡ Key Metrics

Latency: TTFT, inter-token, e2e (P50/P95/P99)
Throughput: tokens/sec (prefill + decode), req/sec
GPU: util %, VRAM, KV-cache occupancy, temp, ECC
Queue: depth, wait time, rejection rate
Errors: 5xx, timeouts, OOM, CUDA errors

All segmented per-model, per-GPU

⬡ Instrumentation Stack

Prometheus — metrics from vLLM/Triton
NVIDIA DCGM — GPU telemetry (util, mem, Xid errors)
OpenTelemetry — distributed tracing gateway → scheduler → GPU
Structured logs with trace IDs: input/output tokens, model, GPU ID, latency breakdown
Synthetic probes — fixed prompts every N sec as canary

⬡ High Availability

Liveness + readiness health checks per node
LB auto-drains unhealthy nodes in seconds
Canary deploys for model updates (5% → compare → full rollout)
Circuit breaker per model — shed with 503
Multi-AZ GPU placement + failover routing
A/B latency comparison on every infra change

⬡ Alert Strategy — SLO Burn-Rate Based

🔴 Page (Immediate)

GPU Xid errors (hardware fault)
OOM kills on inference nodes
Serving process crash
TTFT P99 > 2× SLO target

🟡 Warning (Ticket)

Queue depth sustained high
GPU util < 30% (waste) or > 95% (saturation)
Memory fragmentation trending up
Error budget burning 10× sustainable rate

🔵 Anomaly Detection

Statistical deviation in tokens/sec per model
Catches silent model corruption
Degraded GPU without hard faults
Continuous synthetic probe regression