| Layer | Tool | Role | Key Feature |
|---|---|---|---|
| Orchestration | Triton | Fleet management — model versioning, A/B routing, multi-model mux, health checks | Model-agnostic, serves LLMs + vision + ensembles |
| Serving | vLLM | Request scheduling — continuous batching, KV-cache management | PagedAttention — virtual memory for KV-cache |
| Engine | TensorRT-LLM | Optimized execution — kernel fusion, quantization, custom CUDA | FP8/INT4 quantization, NVIDIA-specific optimization |
| Static | Dynamic / Continuous | |
|---|---|---|
| Mechanism | Wait for N reqs, pad to max len | Join batch each decode step |
| GPU Waste | High — padding tokens | Minimal — no padding |
| TTFT | Bad — head-of-line blocking | Good — immediate admission |
| Throughput | Moderate | High — slots freed immediately |
| Verdict | Continuous batching required for heterogeneous lengths + strict TTFT | |