SYS DESIGN // 10 Questions

01

AI Infrastructure

Design an LLM Inference Serving System

Requirements

Serve multiple LLMs (GPT-4, Llama, Mistral) to thousands of concurrent users
Optimize for latency (TTFT < 200ms) and throughput
Handle variable-length inputs/outputs with streaming responses
Cost-efficient GPU utilization at scale

Architecture

Client ──> API Gateway ──> Auth + Rate Limiter ──> Model Router
                                                       │
                          ┌────────────────────────────┤
                          │                            │
                   ┌──────▼──────┐             ┌──────▼──────┐
                   │ Prefill Pool│             │ Decode Pool  │
                   │ (Compute)   │──KV Cache──>│ (Memory BW)  │
                   └─────────────┘             └──────┬──────┘
                                                      │
                                               Response Streamer ──> Client

Capacity Estimation

Concurrency	10K concurrent requests
Throughput	~50K tokens/sec per GPU (decode)
Latency SLA	TTFT < 200ms, TPS > 30 tok/s
GPU Fleet	~200 H100s for multi-model serving

Key Design Decisions

Continuous batching: iteration-level scheduling eliminates head-of-line blocking
PagedAttention: KV cache in fixed blocks like OS virtual memory, near-zero fragmentation
Disaggregated serving: separate prefill (compute-bound) from decode (memory-bound) nodes

Routing & Scheduling

Prefix caching: route same system prompts to same GPU (reuse KV cache)
Session affinity: multi-turn conversations to same server
Load balancing: least-connections weighted by queue depth
Autoscaling: scale on queue depth, NOT CPU utilization

Observability

TTFT (Time to First Token) — measures prefill speed
TPS (Tokens Per Second) — decode throughput
Queue depth per model pool
GPU utilization, HBM usage, KV cache hit rate
P99 end-to-end latency, error rate by model

Rate Limiting

Token bucket per user/org — separate limits for requests/min AND tokens/min. Priority queues: paid > free, short > long. Backpressure: 429 with Retry-After header. Circuit breaker: if pool unhealthy, fail fast.

Red Flags to Avoid

No warm pool strategy for cold start GPU spin-up
Static batching instead of continuous batching
Ignoring the memory-bound nature of decode phase
No prefix caching or session affinity

02

AI Infrastructure

Design a Real-Time AI Agent Platform

Requirements

Platform for building, deploying, and managing AI agents
Agents use tools, call APIs, maintain memory across turns
Support agent-to-agent orchestration (sequential, parallel, supervisor)
Enterprise-grade reliability, safety guardrails, and observability

Architecture

Developer ──> Agent Definition (YAML/Code) ──> Agent Registry
                                                    │
User Request ──> Agent Runtime ──────────────> Orchestrator
                      │                            │
              ┌───────┼───────┐            ┌───────┼───────┐
              │       │       │            │       │       │
          Tool Exec  LLM   Memory     Agent A  Agent B  Agent C
          (sandbox) Gateway  Store     (parallel / sequential)
              │       │       │
          Guardrails  │    Vector DB
              │       │    + Redis
              └───────┼───────┘
                      │
              Trace Collector ──> Observability Dashboard

Agent Runtime

Event loop: receive input → plan → execute tool → observe → repeat
Tool execution: sandboxed containers, timeouts (30s default), retries with backoff
Idempotency: tool calls tagged with unique IDs to prevent duplicate side effects
Budget enforcement: token limits per execution, max tool calls per turn

Memory Architecture

Short-term: conversation history in Redis, sliding window with summarization
Long-term: vector DB (embeddings of past interactions, user preferences)
Working memory: scratchpad for multi-step reasoning, cleared per task
Shared memory: cross-agent state for orchestrated workflows

Orchestration Patterns

Sequential: Agent A → Agent B → Agent C (pipeline)
Parallel: fan-out to multiple agents, merge results
Supervisor: meta-agent delegates and evaluates sub-agent outputs
Failure handling: fallback agents, human-in-the-loop escalation

Safety & Guardrails

Input/output content filtering (PII, harmful content)
Tool permission model: agents can only call whitelisted tools
Loop detection: kill agents stuck in infinite tool-call cycles
Full execution trace: every LLM call, tool call, decision logged

Red Flags to Avoid

No cost control — agents can burn unlimited tokens
No tracing — impossible to debug agent behavior
Trusting LLM output without validation before tool execution
No loop/runaway detection for autonomous agents

03

AI Infrastructure

Design a GPU Cluster Scheduler

Requirements

Schedule training + inference workloads across 10K+ GPUs
Maximize utilization while meeting latency SLAs
Handle heterogeneous hardware and topology-aware placement
Support preemption, priority queues, and fault tolerance

Architecture

Job Submission API ──> Priority Queue ──> Scheduler
                                            │
                   ┌────────────────────────┤
                   │                        │
           Resource Manager          Topology Manager
           (GPU inventory,           (NVLink domains,
            utilization)              rack layout)
                   │                        │
                   └──────────┬─────────────┘
                              │
                       Node Agents (per host)
                       ┌──────┼──────┐
                       │      │      │
                    GPU 0-7  GPU 0-7  GPU 0-7
                              │
                    Health Monitor ──> Checkpoint Store
                    (heartbeats,       (distributed FS)
                     failure detect)

Scheduling Strategy

Gang scheduling: all-or-nothing for distributed training (need all 64 GPUs or none)
Bin-packing: for inference, maximize GPU utilization per node
Topology-aware: TP within NVLink domain (8 GPUs), PP across InfiniBand
Preemption: checkpoints low-priority job, evicts, schedules high-priority

Fault Tolerance

MTBF at scale: 10K GPUs = node failures daily
Async checkpointing: save to distributed FS without blocking training
Elastic training: shrink/expand GPU count without restart
Health monitoring: heartbeats, GPU temp, ECC errors, NVLink status

Anti-Fragmentation

GPU stranding problem: 7 of 8 GPUs free on a node but can't schedule an 8-GPU job. Solutions: defragmentation (migrate small jobs), reservation (hold full nodes for large jobs), backfill scheduling (fill gaps with small jobs that fit). Track fragmentation ratio as a key metric.

Red Flags to Avoid

Treating all GPUs as identical — topology and interconnect matter
No preemption strategy — high-priority jobs wait forever
Queue starvation — small jobs blocked by large gang-scheduled jobs

04

AI Infrastructure

Design a RAG Pipeline at Scale

Requirements

Enterprise knowledge base: millions of documents, multi-modal
Low-latency retrieval (< 100ms) with high relevance
Real-time indexing, incremental updates, access control
Grounded answers with citations and confidence scores

Architecture

Documents ──> Ingestion ──> Chunking ──> Embedding Model ──> Vector DB
    │             │                                             │
    │         Parser                                     Hybrid Index
    │     (PDF, HTML,                                (Dense + BM25 Sparse)
    │      images)                                          │
    │                                                       │
User Query ──> Query Encoder ──> Retriever ──> Reranker ──> Context
                                                              │
                                                        LLM + Citations
                                                              │
                                                        Grounded Answer

Chunking Strategies

Fixed-size: 512 tokens with 50-token overlap. Simple, fast.
Semantic: split on topic boundaries using embeddings. Better retrieval.
Structure-aware: respect document hierarchy (headers, sections, tables).
Parent-child: retrieve small chunks, return parent context to LLM.

Hybrid Search

Dense: embedding similarity (semantic meaning)
Sparse: BM25/keyword matching (exact terms, names, codes)
Fusion: Reciprocal Rank Fusion (RRF) to merge result lists
Reranker: cross-encoder scores query-doc pairs for final precision

Access Control

Per-document ACLs stored alongside embeddings
Filter applied at retrieval time (pre-filter in vector DB)
Tenant isolation: separate index partitions per org
Audit trail: log every query + retrieved documents

Evaluation

Retrieval: Recall@k, MRR (Mean Reciprocal Rank)
Answer quality: faithfulness (is it grounded?), relevance
Groundedness: can every claim be traced to a source chunk?
Freshness: avg time from document update to index update

Red Flags to Avoid

No access control — the #1 enterprise concern
Dense-only search — misses exact keyword matches
No hallucination mitigation (citations, confidence scores)
Ignoring cold start for newly ingested documents

05

AI Infrastructure

Design a Model Training Pipeline

Requirements

Train 70B+ parameter LLMs across GPU clusters
3D parallelism: data, tensor, pipeline
Survive node failures without losing days of progress
TB-scale data pipeline, experiment tracking, model versioning

Architecture

Raw Data ──> Data Pipeline ──> Tokenizer ──> Sharded Dataset (obj store)
                                                     │
                                              Distributed Trainer
                                        ┌────────────┼────────────┐
                                   Data Parallel  Tensor Par.  Pipeline Par.
                                     (FSDP)     (within node) (across nodes)
                                        │            │            │
                                        └────────────┼────────────┘
                                                     │
                                              Mixed Precision (BF16)
                                              Gradient Accumulation
                                                     │
                                    ┌────────────────┼────────────────┐
                                Checkpoint Mgr          Experiment Tracker
                                (async, distributed)    (loss, LR, metrics)
                                    │                         │
                                Model Registry ◄──────── Evaluation Suite

3D Parallelism

Data Parallel (FSDP): shard optimizer states across ranks, all-reduce gradients
Tensor Parallel: split attention heads + FFN within NVLink domain (8 GPUs)
Pipeline Parallel: layers across nodes via InfiniBand, micro-batching hides bubbles
Expert Parallel: MoE routing, each expert on different GPUs

Memory Optimization

Mixed precision: BF16 for compute (no loss scaling), FP32 master weights
Activation checkpointing: recompute activations in backward pass. 50% less memory, 30% more compute
Gradient accumulation: K micro-batches before optimizer step. Larger effective batch without more memory
Offloading: optimizer states to CPU RAM when GPU memory is tight

Checkpointing & Fault Tolerance

Full checkpoint: weights + optimizer states + LR scheduler. ~3x model size (Adam stores m, v per param). Frequency tradeoff: every 1000 steps = ~30 min of I/O overhead per day vs. risk of losing work. Async checkpointing: write to distributed storage in background, don't block training. Elastic training: PyTorch Elastic / DeepSpeed can shrink/grow worker count without full restart.

Red Flags to Avoid

Forgetting the data pipeline — often the real bottleneck
No topology-aware placement — InfiniBand for all-reduce is critical
Synchronous checkpointing blocking training for minutes
No reproducibility: random seeds, non-deterministic ops

06

Classic

Design Netflix (Video Streaming)

Requirements

200M+ users globally, 10M concurrent streams at peak
Adaptive bitrate for varying network conditions
Upload/transcode pipeline + content recommendation
Video start < 2s, rebuffer ratio < 1%

Architecture

Content Upload ──> Transcode DAG ──> Blob Storage (S3)
                  (multiple resolutions,         │
                   codecs: H.264, H.265, AV1)    │
                                           CDN Edge Servers
                                          (geographic distrib.)
                                                 │
Client ──> API Gateway ──> Content Service       │
               │                                 │
        User Profile DB              Adaptive Bitrate (HLS/DASH)
        Watch History         Client picks quality segment-by-segment
               │
        Recommendation Engine (collaborative + content-based ML)

Capacity

Read/Write	1000:1 ratio (reads dominate)
Bandwidth	~100 Tbps at peak globally
Storage	~100 PB (millions of titles x resolutions)
CDN	Thousands of edge POPs worldwide

Key Components

CDN: edge caching, cache warming for new releases, geographic routing
Adaptive bitrate: HLS/DASH segments, client picks quality per segment
Transcode: DAG of jobs, each title encoded in ~10 resolutions x ~3 codecs
DRM: Widevine (Android/Chrome), FairPlay (Apple), PlayReady (Windows)

Red Flags to Avoid

Serving video from origin — CDN is non-negotiable
No DRM/content protection mentioned
Thundering herd on popular new releases — need cache warming

07

Classic

Design a URL Shortener (bit.ly)

Requirements

100M+ URLs/month, redirect latency < 10ms
Custom or auto-generated short aliases
Click analytics (count, geo, referrer)
High availability and durability

Architecture

Create: Client ──> API ──> ID Generator ──> DB (Cassandra/DynamoDB)
                              │
                       Base62(hash) or
                       pre-generated pool

Redirect: Client ──> LB ──> Cache (Redis) ──hit──> 302 Redirect
                                │
                              miss ──> DB lookup ──> cache + redirect

Analytics: Redirect ──> Kafka ──> Aggregation ──> Analytics DB
                  (async, don't block redirect)

Capacity

Writes	100M/month = ~40/sec
Reads	100:1 = ~4,000 redirects/sec
Storage	100M x 1KB = 100GB/month, 6TB over 5 years
ID space	7 chars Base62 = 3.5 trillion unique URLs

Key Decisions

ID generation: Base62 encoding of counter or hash. Avoid sequential (insecure).
Caching: Redis, 80/20 rule. LRU eviction.
301 vs 302: 302 (temporary) so redirects always hit server (analytics). 301 = browser caches.
DB: NoSQL — simple key-value, no joins, partition by short_url hash.

Red Flags to Avoid

Sequential IDs — predictable, enumerable, insecure
301 redirects — browser caches bypass your analytics
No abuse prevention: rate limiting, spam URL detection

08

Classic

Design a Chat System (WhatsApp)

Requirements

2B users, 100B messages/day, real-time delivery
1:1 and group messaging (up to 1024 members)
Delivery guarantees, read receipts, online/offline status
End-to-end encryption (Signal Protocol)

Architecture

Client ──WebSocket──> WS Gateway ──> Message Service ──> User Inbox (DB)
                          │                │                    │
                    Connection Mgr    Sequence Gen        Push Notification
                    (which server     (per-conversation     (for offline
                     has user X?)      ordering)             users)
                          │
                    Presence Service (online/offline, last seen)

Media: Client ──> Upload ──> Blob Store ──> CDN
                  (E2E encrypted on client before upload)

Group: Fan-out on write: message copied to each member's inbox

Connection & Delivery

WebSocket: persistent connections for real-time, long-poll fallback
Connection registry: which gateway server holds user X's connection?
Offline queue: per-user inbox, drain on reconnect
Idempotency: client-generated message IDs prevent duplicates

Ordering & Groups

Ordering: per-conversation sequence numbers (not global)
Group fan-out on write: copy message to each member's inbox
E2E encryption: Signal Protocol, sender encrypts per-recipient
Server never sees plaintext — stores encrypted blobs only

Red Flags to Avoid

Polling for messages instead of WebSockets/push
Storing decrypted messages server-side
Global message ordering (only per-conversation needed)

09

Classic

Design a News Feed (Twitter/X)

Requirements

500M DAU, personalized feed, real-time updates
Text, images, video, polls — rich media support
Like, retweet, reply interactions
Celebrity problem: users with 100M+ followers

Architecture

Post Tweet ──> Write Service ──> Post Storage (DB)
                    │
               Fan-out Service
               ┌────┴────┐
          Normal users:   Celebrities:
          PUSH to         PULL on read
          follower        (don't fan-out
          timelines       100M writes)
               │
          Timeline Cache (Redis sorted sets, per-user)
               │
Client ──> Feed Service ──> Merge (push + pull) ──> ML Ranker ──> Feed
                                                       │
                                                  Recency + engagement
                                                  + user affinity scores

Fan-out Strategy

Hybrid approach: push for users with < 100K followers, pull for celebrities
Push (fan-out on write): pre-compute timelines at write time
Pull (fan-out on read): compute at read time for celebrities
Merge: feed service merges pre-computed + on-demand + ranked

Ranking & Cache

Timeline cache: Redis sorted sets (score = timestamp or rank score)
ML ranking: features = recency, engagement, user affinity, content type
Cache invalidation: delete/edit = invalidate across all follower caches
Media pipeline: async upload → transcode → CDN

Red Flags to Avoid

Fan-out for celebrities — 100M writes per tweet is unscalable
Forgetting cache invalidation for deleted/edited posts
Hot partitions from celebrity tweets creating uneven load

10

Classic

Design a Rate Limiter

Requirements

Protect APIs from abuse, per-user/IP/global limits
Distributed: work across multiple API servers
Low latency overhead (< 1ms per check)
Configurable rules and graceful degradation

Architecture

Request ──> API Gateway ──> Rate Limiter Middleware
                                  │
                           Rules Engine
                           (per user, per IP,
                            per endpoint, global)
                                  │
                           Counter Store (Redis)
                           ┌──────┼──────┐
                      Token    Sliding    Fixed
                      Bucket   Window     Window
                                  │
                           ┌──────┴──────┐
                       ALLOW              REJECT
                         │                  │
                    API Service         429 + Retry-After
                         │              + X-RateLimit-Remaining
                      Response

Algorithms

Token Bucket: refill N tokens/sec, each request costs 1+ tokens. Allows bursts. Best for APIs.
Sliding Window Log: store timestamps of each request, count in window. Precise but memory-heavy.
Sliding Window Counter: hybrid — fixed window counts + weighted interpolation. Good balance.
Fixed Window: simple counter per time window. Edge case: 2x burst at window boundary.

Distributed Counting

Redis INCR + EXPIRE: atomic increment with TTL. Fast, simple.
Race conditions: use MULTI/EXEC or Lua scripts for atomic check-and-increment
Clock skew: use Redis server time, not client time
Failover: if Redis is down, fail open (allow) or fail closed (reject)?

Response Headers

Always return: X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (when window resets), and on 429: Retry-After header. This lets clients implement backoff without guessing.

Red Flags to Avoid

Single-node solution — must work across API fleet
No race condition handling in distributed counter
Ignoring clock skew between servers
Missing rate limit response headers