System Design Interview Prep

10 Design Questions

AI Infrastructure + Classic Systems
01
AI Infrastructure
Design an LLM Inference Serving System

Requirements

  • Serve multiple LLMs (GPT-4, Llama, Mistral) to thousands of concurrent users
  • Optimize for latency (TTFT < 200ms) and throughput
  • Handle variable-length inputs/outputs with streaming responses
  • Cost-efficient GPU utilization at scale
Architecture
Client ──> API Gateway ──> Auth + Rate Limiter ──> Model Router
                                                       │
                          ┌────────────────────────────┤
                          │                            │
                   ┌──────▼──────┐             ┌──────▼──────┐
                   │ Prefill Pool│             │ Decode Pool  │
                   │ (Compute)   │──KV Cache──>│ (Memory BW)  │
                   └─────────────┘             └──────┬──────┘
                                                      │
                                               Response Streamer ──> Client
Capacity Estimation
Concurrency10K concurrent requests
Throughput~50K tokens/sec per GPU (decode)
Latency SLATTFT < 200ms, TPS > 30 tok/s
GPU Fleet~200 H100s for multi-model serving
Key Design Decisions
  • Continuous batching: iteration-level scheduling eliminates head-of-line blocking
  • PagedAttention: KV cache in fixed blocks like OS virtual memory, near-zero fragmentation
  • Disaggregated serving: separate prefill (compute-bound) from decode (memory-bound) nodes
Routing & Scheduling
  • Prefix caching: route same system prompts to same GPU (reuse KV cache)
  • Session affinity: multi-turn conversations to same server
  • Load balancing: least-connections weighted by queue depth
  • Autoscaling: scale on queue depth, NOT CPU utilization
Observability
  • TTFT (Time to First Token) — measures prefill speed
  • TPS (Tokens Per Second) — decode throughput
  • Queue depth per model pool
  • GPU utilization, HBM usage, KV cache hit rate
  • P99 end-to-end latency, error rate by model
Rate Limiting

Token bucket per user/org — separate limits for requests/min AND tokens/min. Priority queues: paid > free, short > long. Backpressure: 429 with Retry-After header. Circuit breaker: if pool unhealthy, fail fast.

Red Flags to Avoid

  • No warm pool strategy for cold start GPU spin-up
  • Static batching instead of continuous batching
  • Ignoring the memory-bound nature of decode phase
  • No prefix caching or session affinity
02
AI Infrastructure
Design a Real-Time AI Agent Platform

Requirements

  • Platform for building, deploying, and managing AI agents
  • Agents use tools, call APIs, maintain memory across turns
  • Support agent-to-agent orchestration (sequential, parallel, supervisor)
  • Enterprise-grade reliability, safety guardrails, and observability
Architecture
Developer ──> Agent Definition (YAML/Code) ──> Agent Registry
                                                    │
User Request ──> Agent Runtime ──────────────> Orchestrator
                      │                            │
              ┌───────┼───────┐            ┌───────┼───────┐
              │       │       │            │       │       │
          Tool Exec  LLM   Memory     Agent A  Agent B  Agent C
          (sandbox) Gateway  Store     (parallel / sequential)
              │       │       │
          Guardrails  │    Vector DB
              │       │    + Redis
              └───────┼───────┘
                      │
              Trace Collector ──> Observability Dashboard
Agent Runtime
  • Event loop: receive input → plan → execute tool → observe → repeat
  • Tool execution: sandboxed containers, timeouts (30s default), retries with backoff
  • Idempotency: tool calls tagged with unique IDs to prevent duplicate side effects
  • Budget enforcement: token limits per execution, max tool calls per turn
Memory Architecture
  • Short-term: conversation history in Redis, sliding window with summarization
  • Long-term: vector DB (embeddings of past interactions, user preferences)
  • Working memory: scratchpad for multi-step reasoning, cleared per task
  • Shared memory: cross-agent state for orchestrated workflows
Orchestration Patterns
  • Sequential: Agent A → Agent B → Agent C (pipeline)
  • Parallel: fan-out to multiple agents, merge results
  • Supervisor: meta-agent delegates and evaluates sub-agent outputs
  • Failure handling: fallback agents, human-in-the-loop escalation
Safety & Guardrails
  • Input/output content filtering (PII, harmful content)
  • Tool permission model: agents can only call whitelisted tools
  • Loop detection: kill agents stuck in infinite tool-call cycles
  • Full execution trace: every LLM call, tool call, decision logged

Red Flags to Avoid

  • No cost control — agents can burn unlimited tokens
  • No tracing — impossible to debug agent behavior
  • Trusting LLM output without validation before tool execution
  • No loop/runaway detection for autonomous agents
03
AI Infrastructure
Design a GPU Cluster Scheduler

Requirements

  • Schedule training + inference workloads across 10K+ GPUs
  • Maximize utilization while meeting latency SLAs
  • Handle heterogeneous hardware and topology-aware placement
  • Support preemption, priority queues, and fault tolerance
Architecture
Job Submission API ──> Priority Queue ──> Scheduler
                                            │
                   ┌────────────────────────┤
                   │                        │
           Resource Manager          Topology Manager
           (GPU inventory,           (NVLink domains,
            utilization)              rack layout)
                   │                        │
                   └──────────┬─────────────┘
                              │
                       Node Agents (per host)
                       ┌──────┼──────┐
                       │      │      │
                    GPU 0-7  GPU 0-7  GPU 0-7
                              │
                    Health Monitor ──> Checkpoint Store
                    (heartbeats,       (distributed FS)
                     failure detect)
Scheduling Strategy
  • Gang scheduling: all-or-nothing for distributed training (need all 64 GPUs or none)
  • Bin-packing: for inference, maximize GPU utilization per node
  • Topology-aware: TP within NVLink domain (8 GPUs), PP across InfiniBand
  • Preemption: checkpoints low-priority job, evicts, schedules high-priority
Fault Tolerance
  • MTBF at scale: 10K GPUs = node failures daily
  • Async checkpointing: save to distributed FS without blocking training
  • Elastic training: shrink/expand GPU count without restart
  • Health monitoring: heartbeats, GPU temp, ECC errors, NVLink status
Anti-Fragmentation

GPU stranding problem: 7 of 8 GPUs free on a node but can't schedule an 8-GPU job. Solutions: defragmentation (migrate small jobs), reservation (hold full nodes for large jobs), backfill scheduling (fill gaps with small jobs that fit). Track fragmentation ratio as a key metric.

Red Flags to Avoid

  • Treating all GPUs as identical — topology and interconnect matter
  • No preemption strategy — high-priority jobs wait forever
  • Queue starvation — small jobs blocked by large gang-scheduled jobs
04
AI Infrastructure
Design a RAG Pipeline at Scale

Requirements

  • Enterprise knowledge base: millions of documents, multi-modal
  • Low-latency retrieval (< 100ms) with high relevance
  • Real-time indexing, incremental updates, access control
  • Grounded answers with citations and confidence scores
Architecture
Documents ──> Ingestion ──> Chunking ──> Embedding Model ──> Vector DB
    │             │                                             │
    │         Parser                                     Hybrid Index
    │     (PDF, HTML,                                (Dense + BM25 Sparse)
    │      images)                                          │
    │                                                       │
User Query ──> Query Encoder ──> Retriever ──> Reranker ──> Context
                                                              │
                                                        LLM + Citations
                                                              │
                                                        Grounded Answer
Chunking Strategies
  • Fixed-size: 512 tokens with 50-token overlap. Simple, fast.
  • Semantic: split on topic boundaries using embeddings. Better retrieval.
  • Structure-aware: respect document hierarchy (headers, sections, tables).
  • Parent-child: retrieve small chunks, return parent context to LLM.
Hybrid Search
  • Dense: embedding similarity (semantic meaning)
  • Sparse: BM25/keyword matching (exact terms, names, codes)
  • Fusion: Reciprocal Rank Fusion (RRF) to merge result lists
  • Reranker: cross-encoder scores query-doc pairs for final precision
Access Control
  • Per-document ACLs stored alongside embeddings
  • Filter applied at retrieval time (pre-filter in vector DB)
  • Tenant isolation: separate index partitions per org
  • Audit trail: log every query + retrieved documents
Evaluation
  • Retrieval: Recall@k, MRR (Mean Reciprocal Rank)
  • Answer quality: faithfulness (is it grounded?), relevance
  • Groundedness: can every claim be traced to a source chunk?
  • Freshness: avg time from document update to index update

Red Flags to Avoid

  • No access control — the #1 enterprise concern
  • Dense-only search — misses exact keyword matches
  • No hallucination mitigation (citations, confidence scores)
  • Ignoring cold start for newly ingested documents
05
AI Infrastructure
Design a Model Training Pipeline

Requirements

  • Train 70B+ parameter LLMs across GPU clusters
  • 3D parallelism: data, tensor, pipeline
  • Survive node failures without losing days of progress
  • TB-scale data pipeline, experiment tracking, model versioning
Architecture
Raw Data ──> Data Pipeline ──> Tokenizer ──> Sharded Dataset (obj store)
                                                     │
                                              Distributed Trainer
                                        ┌────────────┼────────────┐
                                   Data Parallel  Tensor Par.  Pipeline Par.
                                     (FSDP)     (within node) (across nodes)
                                        │            │            │
                                        └────────────┼────────────┘
                                                     │
                                              Mixed Precision (BF16)
                                              Gradient Accumulation
                                                     │
                                    ┌────────────────┼────────────────┐
                                Checkpoint Mgr          Experiment Tracker
                                (async, distributed)    (loss, LR, metrics)
                                    │                         │
                                Model Registry ◄──────── Evaluation Suite
3D Parallelism
  • Data Parallel (FSDP): shard optimizer states across ranks, all-reduce gradients
  • Tensor Parallel: split attention heads + FFN within NVLink domain (8 GPUs)
  • Pipeline Parallel: layers across nodes via InfiniBand, micro-batching hides bubbles
  • Expert Parallel: MoE routing, each expert on different GPUs
Memory Optimization
  • Mixed precision: BF16 for compute (no loss scaling), FP32 master weights
  • Activation checkpointing: recompute activations in backward pass. 50% less memory, 30% more compute
  • Gradient accumulation: K micro-batches before optimizer step. Larger effective batch without more memory
  • Offloading: optimizer states to CPU RAM when GPU memory is tight
Checkpointing & Fault Tolerance

Full checkpoint: weights + optimizer states + LR scheduler. ~3x model size (Adam stores m, v per param). Frequency tradeoff: every 1000 steps = ~30 min of I/O overhead per day vs. risk of losing work. Async checkpointing: write to distributed storage in background, don't block training. Elastic training: PyTorch Elastic / DeepSpeed can shrink/grow worker count without full restart.

Red Flags to Avoid

  • Forgetting the data pipeline — often the real bottleneck
  • No topology-aware placement — InfiniBand for all-reduce is critical
  • Synchronous checkpointing blocking training for minutes
  • No reproducibility: random seeds, non-deterministic ops
06
Classic
Design Netflix (Video Streaming)

Requirements

  • 200M+ users globally, 10M concurrent streams at peak
  • Adaptive bitrate for varying network conditions
  • Upload/transcode pipeline + content recommendation
  • Video start < 2s, rebuffer ratio < 1%
Architecture
Content Upload ──> Transcode DAG ──> Blob Storage (S3)
                  (multiple resolutions,         │
                   codecs: H.264, H.265, AV1)    │
                                           CDN Edge Servers
                                          (geographic distrib.)
                                                 │
Client ──> API Gateway ──> Content Service       │
               │                                 │
        User Profile DB              Adaptive Bitrate (HLS/DASH)
        Watch History         Client picks quality segment-by-segment
               │
        Recommendation Engine (collaborative + content-based ML)
Capacity
Read/Write1000:1 ratio (reads dominate)
Bandwidth~100 Tbps at peak globally
Storage~100 PB (millions of titles x resolutions)
CDNThousands of edge POPs worldwide
Key Components
  • CDN: edge caching, cache warming for new releases, geographic routing
  • Adaptive bitrate: HLS/DASH segments, client picks quality per segment
  • Transcode: DAG of jobs, each title encoded in ~10 resolutions x ~3 codecs
  • DRM: Widevine (Android/Chrome), FairPlay (Apple), PlayReady (Windows)

Red Flags to Avoid

  • Serving video from origin — CDN is non-negotiable
  • No DRM/content protection mentioned
  • Thundering herd on popular new releases — need cache warming
07
Classic
Design a URL Shortener (bit.ly)

Requirements

  • 100M+ URLs/month, redirect latency < 10ms
  • Custom or auto-generated short aliases
  • Click analytics (count, geo, referrer)
  • High availability and durability
Architecture
Create: Client ──> API ──> ID Generator ──> DB (Cassandra/DynamoDB)
                              │
                       Base62(hash) or
                       pre-generated pool

Redirect: Client ──> LB ──> Cache (Redis) ──hit──> 302 Redirect
                                │
                              miss ──> DB lookup ──> cache + redirect

Analytics: Redirect ──> Kafka ──> Aggregation ──> Analytics DB
                  (async, don't block redirect)
Capacity
Writes100M/month = ~40/sec
Reads100:1 = ~4,000 redirects/sec
Storage100M x 1KB = 100GB/month, 6TB over 5 years
ID space7 chars Base62 = 3.5 trillion unique URLs
Key Decisions
  • ID generation: Base62 encoding of counter or hash. Avoid sequential (insecure).
  • Caching: Redis, 80/20 rule. LRU eviction.
  • 301 vs 302: 302 (temporary) so redirects always hit server (analytics). 301 = browser caches.
  • DB: NoSQL — simple key-value, no joins, partition by short_url hash.

Red Flags to Avoid

  • Sequential IDs — predictable, enumerable, insecure
  • 301 redirects — browser caches bypass your analytics
  • No abuse prevention: rate limiting, spam URL detection
08
Classic
Design a Chat System (WhatsApp)

Requirements

  • 2B users, 100B messages/day, real-time delivery
  • 1:1 and group messaging (up to 1024 members)
  • Delivery guarantees, read receipts, online/offline status
  • End-to-end encryption (Signal Protocol)
Architecture
Client ──WebSocket──> WS Gateway ──> Message Service ──> User Inbox (DB)
                          │                │                    │
                    Connection Mgr    Sequence Gen        Push Notification
                    (which server     (per-conversation     (for offline
                     has user X?)      ordering)             users)
                          │
                    Presence Service (online/offline, last seen)

Media: Client ──> Upload ──> Blob Store ──> CDN
                  (E2E encrypted on client before upload)

Group: Fan-out on write: message copied to each member's inbox
Connection & Delivery
  • WebSocket: persistent connections for real-time, long-poll fallback
  • Connection registry: which gateway server holds user X's connection?
  • Offline queue: per-user inbox, drain on reconnect
  • Idempotency: client-generated message IDs prevent duplicates
Ordering & Groups
  • Ordering: per-conversation sequence numbers (not global)
  • Group fan-out on write: copy message to each member's inbox
  • E2E encryption: Signal Protocol, sender encrypts per-recipient
  • Server never sees plaintext — stores encrypted blobs only

Red Flags to Avoid

  • Polling for messages instead of WebSockets/push
  • Storing decrypted messages server-side
  • Global message ordering (only per-conversation needed)
09
Classic
Design a News Feed (Twitter/X)

Requirements

  • 500M DAU, personalized feed, real-time updates
  • Text, images, video, polls — rich media support
  • Like, retweet, reply interactions
  • Celebrity problem: users with 100M+ followers
Architecture
Post Tweet ──> Write Service ──> Post Storage (DB)
                    │
               Fan-out Service
               ┌────┴────┐
          Normal users:   Celebrities:
          PUSH to         PULL on read
          follower        (don't fan-out
          timelines       100M writes)
               │
          Timeline Cache (Redis sorted sets, per-user)
               │
Client ──> Feed Service ──> Merge (push + pull) ──> ML Ranker ──> Feed
                                                       │
                                                  Recency + engagement
                                                  + user affinity scores
Fan-out Strategy
  • Hybrid approach: push for users with < 100K followers, pull for celebrities
  • Push (fan-out on write): pre-compute timelines at write time
  • Pull (fan-out on read): compute at read time for celebrities
  • Merge: feed service merges pre-computed + on-demand + ranked
Ranking & Cache
  • Timeline cache: Redis sorted sets (score = timestamp or rank score)
  • ML ranking: features = recency, engagement, user affinity, content type
  • Cache invalidation: delete/edit = invalidate across all follower caches
  • Media pipeline: async upload → transcode → CDN

Red Flags to Avoid

  • Fan-out for celebrities — 100M writes per tweet is unscalable
  • Forgetting cache invalidation for deleted/edited posts
  • Hot partitions from celebrity tweets creating uneven load
10
Classic
Design a Rate Limiter

Requirements

  • Protect APIs from abuse, per-user/IP/global limits
  • Distributed: work across multiple API servers
  • Low latency overhead (< 1ms per check)
  • Configurable rules and graceful degradation
Architecture
Request ──> API Gateway ──> Rate Limiter Middleware
                                  │
                           Rules Engine
                           (per user, per IP,
                            per endpoint, global)
                                  │
                           Counter Store (Redis)
                           ┌──────┼──────┐
                      Token    Sliding    Fixed
                      Bucket   Window     Window
                                  │
                           ┌──────┴──────┐
                       ALLOW              REJECT
                         │                  │
                    API Service         429 + Retry-After
                         │              + X-RateLimit-Remaining
                      Response
Algorithms
  • Token Bucket: refill N tokens/sec, each request costs 1+ tokens. Allows bursts. Best for APIs.
  • Sliding Window Log: store timestamps of each request, count in window. Precise but memory-heavy.
  • Sliding Window Counter: hybrid — fixed window counts + weighted interpolation. Good balance.
  • Fixed Window: simple counter per time window. Edge case: 2x burst at window boundary.
Distributed Counting
  • Redis INCR + EXPIRE: atomic increment with TTL. Fast, simple.
  • Race conditions: use MULTI/EXEC or Lua scripts for atomic check-and-increment
  • Clock skew: use Redis server time, not client time
  • Failover: if Redis is down, fail open (allow) or fail closed (reject)?
Response Headers

Always return: X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (when window resets), and on 429: Retry-After header. This lets clients implement backoff without guessing.

Red Flags to Avoid

  • Single-node solution — must work across API fleet
  • No race condition handling in distributed counter
  • Ignoring clock skew between servers
  • Missing rate limit response headers