Client ──> API Gateway ──> Auth + Rate Limiter ──> Model Router
│
┌────────────────────────────┤
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Prefill Pool│ │ Decode Pool │
│ (Compute) │──KV Cache──>│ (Memory BW) │
└─────────────┘ └──────┬──────┘
│
Response Streamer ──> Client
| Concurrency | 10K concurrent requests |
| Throughput | ~50K tokens/sec per GPU (decode) |
| Latency SLA | TTFT < 200ms, TPS > 30 tok/s |
| GPU Fleet | ~200 H100s for multi-model serving |
Token bucket per user/org — separate limits for requests/min AND tokens/min. Priority queues: paid > free, short > long. Backpressure: 429 with Retry-After header. Circuit breaker: if pool unhealthy, fail fast.
Developer ──> Agent Definition (YAML/Code) ──> Agent Registry
│
User Request ──> Agent Runtime ──────────────> Orchestrator
│ │
┌───────┼───────┐ ┌───────┼───────┐
│ │ │ │ │ │
Tool Exec LLM Memory Agent A Agent B Agent C
(sandbox) Gateway Store (parallel / sequential)
│ │ │
Guardrails │ Vector DB
│ │ + Redis
└───────┼───────┘
│
Trace Collector ──> Observability Dashboard
Job Submission API ──> Priority Queue ──> Scheduler
│
┌────────────────────────┤
│ │
Resource Manager Topology Manager
(GPU inventory, (NVLink domains,
utilization) rack layout)
│ │
└──────────┬─────────────┘
│
Node Agents (per host)
┌──────┼──────┐
│ │ │
GPU 0-7 GPU 0-7 GPU 0-7
│
Health Monitor ──> Checkpoint Store
(heartbeats, (distributed FS)
failure detect)
GPU stranding problem: 7 of 8 GPUs free on a node but can't schedule an 8-GPU job. Solutions: defragmentation (migrate small jobs), reservation (hold full nodes for large jobs), backfill scheduling (fill gaps with small jobs that fit). Track fragmentation ratio as a key metric.
Documents ──> Ingestion ──> Chunking ──> Embedding Model ──> Vector DB
│ │ │
│ Parser Hybrid Index
│ (PDF, HTML, (Dense + BM25 Sparse)
│ images) │
│ │
User Query ──> Query Encoder ──> Retriever ──> Reranker ──> Context
│
LLM + Citations
│
Grounded Answer
Raw Data ──> Data Pipeline ──> Tokenizer ──> Sharded Dataset (obj store)
│
Distributed Trainer
┌────────────┼────────────┐
Data Parallel Tensor Par. Pipeline Par.
(FSDP) (within node) (across nodes)
│ │ │
└────────────┼────────────┘
│
Mixed Precision (BF16)
Gradient Accumulation
│
┌────────────────┼────────────────┐
Checkpoint Mgr Experiment Tracker
(async, distributed) (loss, LR, metrics)
│ │
Model Registry ◄──────── Evaluation Suite
Full checkpoint: weights + optimizer states + LR scheduler. ~3x model size (Adam stores m, v per param). Frequency tradeoff: every 1000 steps = ~30 min of I/O overhead per day vs. risk of losing work. Async checkpointing: write to distributed storage in background, don't block training. Elastic training: PyTorch Elastic / DeepSpeed can shrink/grow worker count without full restart.
Content Upload ──> Transcode DAG ──> Blob Storage (S3)
(multiple resolutions, │
codecs: H.264, H.265, AV1) │
CDN Edge Servers
(geographic distrib.)
│
Client ──> API Gateway ──> Content Service │
│ │
User Profile DB Adaptive Bitrate (HLS/DASH)
Watch History Client picks quality segment-by-segment
│
Recommendation Engine (collaborative + content-based ML)
| Read/Write | 1000:1 ratio (reads dominate) |
| Bandwidth | ~100 Tbps at peak globally |
| Storage | ~100 PB (millions of titles x resolutions) |
| CDN | Thousands of edge POPs worldwide |
Create: Client ──> API ──> ID Generator ──> DB (Cassandra/DynamoDB)
│
Base62(hash) or
pre-generated pool
Redirect: Client ──> LB ──> Cache (Redis) ──hit──> 302 Redirect
│
miss ──> DB lookup ──> cache + redirect
Analytics: Redirect ──> Kafka ──> Aggregation ──> Analytics DB
(async, don't block redirect)
| Writes | 100M/month = ~40/sec |
| Reads | 100:1 = ~4,000 redirects/sec |
| Storage | 100M x 1KB = 100GB/month, 6TB over 5 years |
| ID space | 7 chars Base62 = 3.5 trillion unique URLs |
Client ──WebSocket──> WS Gateway ──> Message Service ──> User Inbox (DB)
│ │ │
Connection Mgr Sequence Gen Push Notification
(which server (per-conversation (for offline
has user X?) ordering) users)
│
Presence Service (online/offline, last seen)
Media: Client ──> Upload ──> Blob Store ──> CDN
(E2E encrypted on client before upload)
Group: Fan-out on write: message copied to each member's inbox
Post Tweet ──> Write Service ──> Post Storage (DB)
│
Fan-out Service
┌────┴────┐
Normal users: Celebrities:
PUSH to PULL on read
follower (don't fan-out
timelines 100M writes)
│
Timeline Cache (Redis sorted sets, per-user)
│
Client ──> Feed Service ──> Merge (push + pull) ──> ML Ranker ──> Feed
│
Recency + engagement
+ user affinity scores
Request ──> API Gateway ──> Rate Limiter Middleware
│
Rules Engine
(per user, per IP,
per endpoint, global)
│
Counter Store (Redis)
┌──────┼──────┐
Token Sliding Fixed
Bucket Window Window
│
┌──────┴──────┐
ALLOW REJECT
│ │
API Service 429 + Retry-After
│ + X-RateLimit-Remaining
Response
Always return: X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (when window resets), and on 429: Retry-After header. This lets clients implement backoff without guessing.