Microsoft Core AI VP Interview

Feb 24 - Mar 5, 2026 // 6 Interviewers

Interview Loop

6 interviews · 3 completed
Mar 2 · 2:00-3:00 PM
Pablo Castro
pcastro@microsoft.com
Technical
📊 View Deck
Walk me through a major technical decision and its tradeoffs
What technical bet didn't work out? What did you learn?
How do you balance org politics with technical architecture?
Example of navigating competing technical stacks?
How do you stay close to technical details as a senior leader?
Click to flip
My Prepared Answer
EXAMPLE: Single Ads AI Agent Stack Situation: My team built first agent runtime, scaled to 60M+ sessions/year. Frontend Ads built their own stack—political/org reasons, wanted independence from consumer. Challenge: Multi-year unification battle. Two stacks: "Service Agent" vs "Product Agent." Resolution: SVP eventually directed single stack. Worked with senior director through productivity council (all LA+ engineering leads). Key decisions: Stack architecture, when to unify vs wait, organizational alignment. Lesson: Org design and relationships matter as much as technical architecture. --- ALTERNATE EXAMPLE --- Email Automation Evolution: • Elixir: Direct F1 calls, 2 CUJs • OSA Studio: Democratized plan building (org challenge) • Catalyst Autoplan: Routing challenge • Catalyst: Current unified approach Technical bet that didn't work: Routing complexity in OSA. Lesson: Organizational readiness matters as much as technical capability. 📌 Question to Ask: "As a Distinguished Engineer leading a team, how do you stay close to the code? What's your workflow for balancing leadership with technical depth?"
Mar 13 · 11:00 AM-12:00 PM
Bilal Alam
bialam@microsoft.com
Systems Design
Design an LLM inference system at scale
How do you think about reliability vs velocity tradeoffs?
Walk me through a major systems migration you led
How do you approach capacity planning for AI workloads?
What's your framework for build vs buy vs wait?
Click to flip
My Prepared Answer
DESIGN: LLM INFERENCE SERVING AT SCALE ═══ REQUEST FLOW ═══ Client → API Gateway → Auth → Rate Limiter → Router → Serving Pool → Response ═══ 1. API GATEWAY & AUTH ═══ • TLS termination, request validation • Auth: API keys, OAuth tokens, or managed identity • Extract: model_id, max_tokens, temperature, user_id • Quota lookup: which tier? (free/pro/enterprise) ═══ 2. THROTTLING & RATE LIMITING ═══ • Token bucket per user/org (e.g., 100K tokens/min) • Separate limits: requests/min AND tokens/min • Priority queues: paid > free, short > long • Backpressure: 429 with Retry-After header • Circuit breaker: if pool unhealthy, fail fast ═══ 3. ROUTING ═══ Model Router decides WHERE to send: • Model affinity: GPT-4 → Pool A, Claude → Pool B • Latency SLO: p50 < 200ms? Route to fastest pool • Cost optimization: batch small requests together • Geographic: route to nearest region Context-Aware Routing: • Prefix caching: same system prompt? Route to GPU with cached KV • Session affinity: multi-turn → same server (KV cache reuse) • Load balancing: least-connections or weighted round-robin ═══ 4. POOLS & POD MANAGEMENT ═══ • Pool = group of GPU nodes serving same model • Pod = K8s unit (1-8 GPUs with tensor parallelism) • Autoscaling: scale on queue depth, not CPU - Scale-up trigger: queue > 100 requests for 30s - Scale-down: idle GPUs for 5 min (expensive!) • Warm pools: keep N pods hot for burst traffic • Preemption: spot/preemptible for batch, on-demand for realtime ═══ 5. CONTINUOUS BATCHING (ELI5) ═══ OLD WAY (Static Batching): Imagine a bus that waits for 8 passengers, drives to destination, then comes back. If passenger 1 wants to go 1 mile and passenger 8 wants 10 miles, everyone waits for the 10-mile trip. NEW WAY (Continuous Batching): Imagine a bus that picks up/drops off passengers at EVERY stop. Passenger 1 gets off at mile 1, and a NEW passenger gets on immediately. The bus is always full and always moving. TECHNICAL: • Iteration = 1 decode step (generate 1 token per sequence) • After each iteration, check: any sequence done? • If done → evict from batch, slot open • If slot open → insert waiting request immediately • Result: no head-of-line blocking, GPU always saturated 📌 Question: "What's your framework for deciding when to build a new service versus extending an existing one?"
Mar 5 · 11:30 AM-12:30 PM
Rajneesh Singh
rajneeshs@microsoft.com
Culture
Tell me about a significant failure and what you learned
How do you create psychological safety on your teams?
Example of growth mindset in action (not just slogans)?
How do you scale culture through your managers?
How do you lead through organizational change or crisis?
Click to flip
My Prepared Answer
EXAMPLE 1: EMR (Engineering Metrics Review) Manager wanted to replicate sales-style metrics review. Support team stalled 6 months claiming difficulties. I led by example—built dashboard from what we DID know rather than waiting for perfect. Started the process, fixed it monthly. Now a cultural norm. Team owns and improves it without me. Lesson: Start imperfect, iterate openly, empower others. EXAMPLE 2: Support Lead Post-Layoff Led team through org change after layoffs. Prioritized emotional support first—acknowledged loss, created space to process. Maintained open dialogue. Built Marketing Advisor team during this period—crisis created opportunity to rebuild culture intentionally. EXAMPLE 3: Swim Lane Tracker Introduced to show work distribution across teams. Made hidden work visible, surfaced load imbalances. Data revealed patterns for rebalancing. Teams saw fairness because it was transparent. Lesson: Build systems that make the right thing visible. PHILOSOPHY: Growth mindset in action, not slogans. Crisis can accelerate culture change when you lead with empathy first, then systems. 📌 Question to Ask: "You spent 16 years at Amazon building teams. What surprised you most about Microsoft's culture when you joined, and what do you think Microsoft does better?"
COMPLETED
Feb 24 · 4:00-5:00 PM
Yina Arenas
yinaa@microsoft.com
XFN Collaboration
Tell me about aligning product, eng, and research on a contentious decision
How do you deliver outcomes across orgs while balancing speed and rigor?
Example of decision-making under ambiguity with multiple stakeholders?
How do you handle disagreements between partner teams?
When have you had to rebuild trust after a misstep?
Click to flip
My Prepared Answer
EXAMPLE 1: APaS Partnership (Proactive Success) Context: Ads Platform as a Service (APaS) partnership on transparency for policy decisions in Google Ads. Stakeholders: Marwan (PM), Jordan (Eng Lead) from APaS. Approach: Aligned early on shared goals—transparency for advertisers. Built trust from the start, not recovery. Two workstreams: 1. Transparent answers for creative disapprovals 2. Streaming data from account suspension ML models Result: Successful ongoing partnership. Example of XFN done right from day one. ——— EXAMPLE 2: TAI Relationship (Recovery) Situation: Got VP greenlight directly but frustrated TAI/GBAI by bypassing normal channels. Action: Rebuilt trust through regular syncs. Created shared roadmap visibility. Result: Cross-org alignment with GBO, gTech, TAI, GBAI. Lesson: Even with exec sponsorship, bring teams along. Speed without buy-in creates debt. 📌 Question to Ask: "With Foundry touching models, tooling, and runtime—how do you navigate when research wants to ship fast versus product wanting enterprise readiness?"
COMPLETED
Feb 27 · 1:00-2:00 PM
Scott Van Vliet
svanvliet@microsoft.com
AI First
What's your experience building AI/ML products?
How do you stay current with AI developments?
Example of applying AI to solve a real business problem?
How do you evaluate when to use AI vs traditional approaches?
What AI trends are you most excited about?
Click to flip
My Prepared Answer
WORK AI PROJECTS: • Marketing Advisor: Chrome extension → VM-based computer control agent. Prototype to GML announcement in 5 months. • UI Control for EMR: AI-powered engineering metrics review automation • Eunice/Uniss: Generalist agent runtime I built for work—multi-provider, multi-agent orchestration • Agentic email: 60M+ sessions/year, reduced case volume 264M→168M PERSONAL AI PROJECTS: • Eunice: My personal generalist agent CLI—runs Claude, GPT, Gemini with tool use • MacroMunch: AI-powered calorie tracker I built for myself • OpenClaw: My own implementation of computer control/UI automation • BlueBubbles + Claude: Integrated iMessage with Claude for personal assistant via text HOW I STAY CURRENT: • Daily hands-on: Build real projects, not just read about AI • Claude Code for all my development work • Local LLMs on my own hardware • Latent Space podcast, arXiv, Twitter/X AI community WHAT EXCITES ME ABOUT AI: The ability to remove toil so work becomes energizing. I want to CREATE MORE, not less. AI should amplify human capability, not replace human judgment. Every project I build is about making myself or my team more effective—shipping faster, automating the boring parts, focusing on what matters. 📌 Question to Ask: "What does success look like for this VP role in year one? What's the hardest problem you're hoping this person helps you solve?"
COMPLETED
Feb 27 · 3:30-4:30 PM
Eric Boyd
emboyd@microsoft.com
Leadership
How do you set vision and translate strategy into execution?
Tell me about holding a leader accountable for underperformance
How do you develop leaders at the director+ level?
Example of a difficult personnel decision and how you handled it?
How do you create organizational impact at scale?
Click to flip
My Prepared Answer
EXAMPLE: Underperformance Spectrum (4 examples, 4 outcomes) 1. PEYMON (PIP → Exit) Hired for AI experience, checked out from start. Over his head on delivery. Put on PIP, took option to leave. Lesson: Watch engagement early. 2. ERIC GAGNON (Managed Out) Promoted IC→manager, wanted the role. Just wanted power—dictator style. Ran multi-model stack, tons of infighting. Team mutinied, escalated to me. Lesson: Desire ≠ capability. 3. HOMAM (Right Person, Wrong Role) Started strong, couldn't coordinate cross-functionally at LX level. Wanted to advance but not technical enough. When I raised issues, he wanted to switch teams. I said I'd be honest with hiring managers. Switched, stayed at Amazon, succeeded in less technical roles. Lesson: Honesty enables better outcomes. 4. JYOTSNA (Turnaround Success) Current direct. Touch and go 6 months, now performing well. Lesson: Patience + clear feedback works. 📌 Question to Ask: "How do you think about the competitive landscape—Azure AI vs. AWS Bedrock vs. GCP Vertex? Where do you see Microsoft's durable advantage?"

Interviewers

6 profiles
DONE
Yina Arenas
CVP, Product - Microsoft Foundry
yinaa@microsoft.com
XFN Collaboration
Feb 24, 4:00-5:00 PM PST
Leads product for Azure AI Foundry, empowering developers to build with generative AI. Portfolio includes Azure OpenAI in Foundry Models, AI Model Ecosystem, AI Agent Runtime, and end-to-end toolchain.
  • Not an AI engineer by trade - built career at intersection of data, BI, and product
  • 10+ years at Microsoft, pivotal role shaping Microsoft Graph
  • Strong background in platform thinking and developer ecosystems
  • Publicly speaks about AI and creative resistance
Cross-org influence, decision-making under ambiguity, delivering outcomes across product/eng/research while balancing speed and rigor. Strategic discussion with selective depth - crisp framing over exhaustive detail.
You built your career in data and BI before leading AI Foundry. What surprised you most about leading an AI platform team compared to your Microsoft Graph days?
With Foundry touching models, tooling, and runtime—how do you navigate when research wants to ship fast versus product wanting enterprise readiness?
Microsoft Graph became essential infrastructure for M365. What's your vision for Foundry becoming that same "invisible backbone" for AI apps?
You've spoken about AI and creative resistance. What patterns do you see in teams that successfully adopt AI tooling versus those that struggle?
DONE
Scott Van Vliet
CVP, Azure OpenAI & AI Core Infrastructure
svanvliet@microsoft.com
AI First
Feb 27, 1:00-2:00 PM PST
Leads Azure OpenAI and AI Core Infrastructure. This is your potential direct manager - the role reports to him.
  • 20+ years building tech products
  • Former CVP of Microsoft Teams and Azure Communication platforms
  • GM at Amazon for Alexa, Echo, and Appstore devices; led Amazon Irvine office
  • SVP of Software Engineering at Relativity Space (rocket company)
  • Executive at Mattel
  • Based in Los Angeles, CA
Experience and genuine interest in AI, how you stay current with rapid developments, and how you apply AI to solve real-world problems. Expects concrete examples of building AI products, evaluating AI approaches, and demonstrating technical depth alongside business impact.
You've led teams at Amazon, Relativity Space, and now back at Microsoft. What drew you back, and what's different about building AI infrastructure versus real-time communications?
What does success look like for this VP role in year one? What's the hardest problem you're hoping this person helps you solve?
Coming from Teams—which ships to hundreds of millions—how do you think about balancing "move fast" startup energy with enterprise reliability in Azure OpenAI?
Leading software at a rocket company must have been unique. What lessons from Relativity Space shape how you think about engineering rigor in AI infrastructure?
How do you personally approach coaching your direct reports? What's something you've learned about developing VPs specifically?
DONE
Eric Boyd
CVP/Managing Director, AI Platform
emboyd@microsoft.com
Leadership
Feb 27, 3:30-4:30 PM PST
Oversees AI tools, infrastructure, hardware, big data systems, and key datasets powering ML across Bing, Bing Ads, and Microsoft Office. Recently expanded role to CVP/Managing Director.
  • Nearly a decade at Microsoft
  • Led Bing Ads engineering team including relevance AI, high-performance serving systems, big-data analytics
  • VP Engineering at Mochi Media (gaming platform)
  • 9+ years at Yahoo! - rose to VP of Platform Engineering
  • BS in Computer Engineering and Mathematics from MIT
Executive discussion grounded in real examples. Show how you set vision, translate strategy into execution, hold teams accountable, and develop leaders. Anchor to measurable outcomes and organizational impact.
Bing Ads serves billions of predictions daily. What architectural principles from that experience do you see as most critical for Azure AI Platform?
Congratulations on the expanded CVP/Managing Director role. What new challenges come with that scope, and how are you thinking about the AI Platform's next chapter?
You've built platform teams at Yahoo, Mochi Media, and Microsoft. What distinguishes the best platform engineers you've hired?
How do you think about the competitive landscape—Azure AI vs. AWS Bedrock vs. GCP Vertex? Where do you see Microsoft's durable advantage?
With nearly a decade leading AI Platform, how do you balance shipping new capabilities versus paying down technical debt in systems that can't go down?
Pablo Castro
CVP & Distinguished Engineer, CoreAI
pcastro@microsoft.com
Technical Retrospective
Mar 2, 2:00-3:00 PM PST
📊 View Deck
Leads the AI Knowledge team in CoreAI division. Executive and hands-on engineer with track record of identifying industry trends.
  • Distinguished Engineer - one of the highest technical ranks at Microsoft
  • Recent work on memory features in Microsoft Foundry
  • Agentic retrieval approaches in Azure AI Search
  • Has GitHub presence - still codes
  • Based in Redmond
Deep dive on major technical decisions, tradeoffs, lessons learned, and how those shape CVP-level judgment. Expects thoughtful reflection on pivotal decisions, what worked/didn't, and how lessons inform future leadership.
As a Distinguished Engineer leading a team, how do you stay close to the code? What's your personal workflow for balancing leadership with technical depth?
RAG architectures are evolving fast—agentic retrieval, memory, hybrid search. What bets is your team making on where retrieval is heading?
You've been identifying industry trends for years. Looking back, what's a major technical bet you made that didn't pan out, and what did you learn?
Distinguished Engineer is rare at Microsoft. How do you think about your influence—through code, architecture reviews, mentorship, or something else?
The memory features in Foundry are interesting for agent continuity. What's the hardest problem in making AI systems that remember context well?
Bilal Alam
Technical Fellow, Developer Division
bialam@microsoft.com
Systems Design
Mar 13, 11:00 AM-12:00 PM PST
Technical Fellow in Developer Division focusing on Azure services. Manages large engineering team overseeing services with significant external revenue.
  • 25+ years at Microsoft - deep institutional knowledge
  • Founder of Azure App Service
  • Founder of Azure Functions
  • Founder of Azure Container Apps
  • Founder of Azure API Management
  • Founder of Azure Logic Apps
  • Founder of Azure Static Web Apps
Systems thinking, scalability, reliability, and long-term architectural tradeoffs. Technical discussion focused on conceptual depth over code-level detail. How you approach system design decisions, risk management, and evolution over time.
You've founded Azure Functions, App Service, Container Apps, and more. What's your framework for deciding when to build a new service versus extending an existing one?
Azure Functions pioneered serverless at Microsoft. How do you see serverless patterns evolving with AI workloads—especially for inference?
25 years at Microsoft is remarkable. How has your approach to systems design evolved? What did you believe early in your career that you now think differently about?
As a Technical Fellow, you're at the pinnacle of the technical ladder. How do you use that position to shape Azure's technical direction?
Your services have millions of users and significant revenue. What's the hardest part of maintaining and evolving systems at that scale without breaking customers?
Rajneesh Singh
VP Engineering, Agent Foundry - CoreAI
rajneeshs@microsoft.com
Growth Mindset & One Microsoft
Mar 5, 11:30 AM-12:30 PM PST
VP of Engineering for Agent Foundry in CoreAI. Leads the platform for building, deploying, and managing AI agents. Joined Microsoft in Oct 2025 from AWS.
  • 16 years at Amazon/AWS before Microsoft (2009-2025)
  • At AWS: Director of GenAI Platform (SageMaker HyperPod, training jobs & frameworks)
  • At AWS: GM & Director for SageMaker Canvas, Forecast, Data Wrangler (no-code ML)
  • At Amazon Retail: Led Product Detail Page and Variation Core Technology
  • B.Tech CS from IIT Delhi, PGSEM from IIM Bangalore
Reflective conversation valuing authenticity and self-awareness. Examples of learning loops, handling setbacks, coaching others, and scaling culture through managers. Focus on behaviors, not slogans.
You spent 16 years at Amazon building teams. What surprised you most about Microsoft's culture when you joined, and what do you think Microsoft does better?
You led SageMaker Canvas—democratizing ML for non-engineers. How does that "lower the barrier" philosophy carry into how you think about Agent Foundry?
Your colleagues describe you as someone who creates paths forward in complex problem spaces. Can you give an example of navigating ambiguity in your first months at Microsoft?
Growth mindset is a Microsoft value, but it can become a slogan. How do you make it real in your team's daily work and hiring?
You've built teams at massive scale at Amazon. What's your framework for recruiting and developing leaders—what do you look for?
Agent Foundry is the platform for AI agents at Microsoft. What's the hardest cultural challenge in getting a new platform adopted across a company this large?

Experience Stories

13 cards
Stats
Google Ads Support Scale
support.google.com is 18th largest website globally
2 billion visits per week (exceeds Netflix)
5,000 Ads Specialists worldwide
30,000 MAU (Monthly Active Users) on support tools
Click to flip
Details
Case volume reduced from 264M to 168M annually (22M to 14M/month). This represents massive automation success while maintaining quality. Marketing Advisor has 10,000 users in Beta.
XFN Collaboration
Cases AI Agent Cross-Org
Reimagining Support Agent roles with AI
Cross-org: GBO, gTech, TAI (Trust & AI), GBAI
VP-level greenlight, but bypassed channels
Rebuilt trust through regular syncs
Click to flip
Details
Next-gen "practitioner-in-the-loop" AI experience. Initially got VP greenlight directly but frustrated partner teams by bypassing normal channels. Learned to balance urgency with stakeholder engagement—even with exec sponsorship, need to bring teams along. Rebuilt trust by establishing regular syncs with TAI and GBAI, creating shared roadmap visibility.
Growth Mindset
Listening to Engineers → Strategy
Sr. Engineer raised VM auth concern
AI agent can't use OAuth for Ads accounts
Planted seed in multi-VP review
Now baked into contract
Click to flip
Details
A skip-skip level engineer raised a concern about VM authentication—our AI agent needs to login to Ads accounts but cannot use OAuth, requiring Ads Platform approval. I listened, took it seriously, then planted the seed by adding it as a requirement in a multi-VP review. Now when we need the negotiation, it is not a net-new ask. Strategic foresight came from listening to an IC engineer.
Growth Mindset
Lead by Example: EMR
Built first monthly Eng health review myself
27 unique assets, borrowed patterns from Sales
Used for 2 years across org
Transferred ownership when team member inspired
Click to flip
Details
Created EMR (Engineering Metrics Review)—a total hack that worked. When a team member wanted to "clean it up," I encouraged him and admitted openly it was a hack. He did not have access, so I used my own AI agent to grant permissions from a spreadsheet. Shared in Eng Managers chat with no expectations. Modeling humility, encouraging ownership, removing blockers quietly.
Growth Mindset
Post-Layoff Transformation
Led team through org change after layoffs
Prioritized emotional support first
Shifted from silos to collaboration
Swim Lane Tracker made work visible
Click to flip
Details
Prioritized comfort first—acknowledged the loss, created space for people to process, maintained open dialogue. Introduced Swim Lane Tracker showing work distribution across teams—made hidden work visible, surfaced load imbalances. Data revealed patterns that informed rebalancing decisions; teams saw fairness because it was transparent. Crisis can accelerate culture change.
Technical
Agentic Email: Phase 1 (Elixir)
Problem: 3.4M cases/yr need human help
Limited TPU capacity
Downselected to 2 CUJs (Critical User Journeys)
Direct F1 (Google DB) calls vs API layers
Click to flip
Details
8.4M cases/year, 5M automated with deterministic flows. Remaining 3.4M always needed humans. Limited TPU meant downselecting to Cancel Account (3K/yr) and Account Suspension (5K/yr). Team hesitant about direct DB calls—multiple API layers existed. Negotiated: we had data access, other teams used F1 directly. Accepted schema-change risk because Ads DB changes are slow/rare.
Technical
Agentic Email: Phase 2 (OSA Studio)
Attempt to "democratize" agent plan building
Frontend tooling done in weeks
Getting plans made was org challenge
Real challenge became routing—picking best plan
Click to flip
Details
OSA (One-Shot Agent) Studio was an attempt to democratize the process of building agent plans we made with Elixir. The frontend tooling was easy—built in a few weeks. Organizationally it was hard to get the plans made. But once we did, the challenge became Routing—picking the best plan for each case.
Technical
Marketing Advisor: Prototype → Prod
Jan 2025: Chrome Extension prototype
May 2025: Announced at GML (Google Marketing Live)
July 2025: Alpha launch
Evolution: Extension → VM Control
Click to flip
Details
From Chrome Extension prototype to production in 6 months. Built initial prototype as Chrome Extension, announced at Google Marketing Live, went into Alpha in July. Technical evolution: moved from Chrome Extension architecture to VM-based computer control for more robust and secure operation.
Leadership
Managing Underperformance
Amazon: 6% unregretted attrition target
Mechanized talent reviews into manager culture
Shifted from outcomes to behaviors
Reviews feel natural, not artificial
Click to flip
Details
Amazon context: mechanized talent reviews to make performance focus part of manager culture—nothing artificial, just consistent accountability. Evolution of thinking: used to judge JUST on outcomes. Now focus on inputs/behaviors—more actionable for the person to improve. Specific example (Peymon): came from accredited org, signs were there but took time to act. Came down to judging behaviors.
Leadership
Developing Directors
Muhammad Yahia: Promoted to L7, key leader
Rajat Dewan: Director promotion in progress
Philosophy: Coaching over prescription
Push strengths, minimize bad style parts
Click to flip
Details
Evolution: used to do "here is how I do it" → now far more personalized. Top talent has diverse styles—that is OK. Push into their strengths, minimize the bad style parts. Do not force your style on them. Key insight: coaching and empowerment over prescription.
Coaching
The Four Stages Framework
Stage 1: "Watch me do it"
Stage 2: "Help me do it"
Stage 3: "I'll help you do it"
Stage 4: "I'll watch you do it"
Click to flip
Details
Match stage to the person AND the specific skill—same person may be Stage 4 on execution but Stage 2 on exec communication. Stage 1: model the behavior, let them observe. Stage 2: they participate while you lead. Stage 3: they lead, you support. Stage 4: full ownership, you observe and provide retrospective coaching.
Coaching
Coaching Through Resistance
Jyotsna: skeptical of bold moves
Her team tried multiple agent techniques
Added Jason to build parallel prototype
Reintegrated teams with same product name
Click to flip
Details
Context: her team tried multiple agent email techniques over time; she was skeptical of bold moves and preferred small, incremental bites. I pushed for bigger "boulder" moves. Action: added Jason to build prototype independently. Reintegration: folded her team back in, gave space to assess. Outcome: everyone aligned. Kept same product name—feels like evolution, not replacement. Coaching insight: sometimes show, do not tell.
Coaching
Good Person, Wrong Role
Steven Pesci: skeptical of LLMs
Physics background → uncomfortable with probabilistic AI
Moved to Foundations team (6-person core systems)
Built MCP layer for F1 database—crushed it
Click to flip
Details
Steven came to me saying he was skeptical of LLMs. His physics background made him uncomfortable with the probabilistic nature of building LLM applications. Rather than losing a talented engineer, I asked if he wanted to work on our Foundations team—a small 6-person team managing core systems including the central F1 database. There was a new project to expose an MCP layer for this database in a secure way with replayability. Steven crushed it. Irony: the lower-stakes, more structured environment actually helped him get comfortable with vibe coding over time. Key lesson: he wasn't failing—he was mismatched. Listen to people's concerns, find the right fit for their strengths.

Systems Knowledge

19 cards
Inference Phases
Prefill vs Decode
Prefill: Compute-bound, all tokens parallel
Decode: Memory-bandwidth-bound, 1 token at a time
Prefill generates KV cache
Most optimization targets decode phase
Click to flip
Details
Prefill (prompt processing) processes all input tokens in parallel, generating the KV cache. Decode (generation) is sequential—generates one token at a time, reading entire KV cache per token. Decode is the bottleneck because it is memory-bandwidth-bound: you read all weights just to produce 1 token.
Memory
Memory Hierarchy
SRAM: ~50MB, 100+ TB/s (KV cache hot path)
HBM: 80-192GB, 2-5 TB/s (model weights)
NVLink: 900 GB/s (multi-GPU, 8 GPUs/node)
InfiniBand: 400 Gb/s (multi-node, low latency)
Ethernet: 400 Gb/s (multi-node, higher latency)
Click to flip
Details
MEMORY HIERARCHY (fast/small → slow/large): 1. SRAM (~50MB per GPU, 100+ TB/s) On-chip L2 cache + SM shared memory. FlashAttention tiles KV cache here. Goal: keep hot data in SRAM. 2. HBM (80-192GB, 2-5 TB/s) GPU main memory. Stores model weights, KV cache overflow. Memory-bandwidth is THE bottleneck for decode. 3. NVLink (900 GB/s per link) GPU-to-GPU within node. Enables tensor parallelism across 8 GPUs. All-reduce for attention/FFN splits. 4. InfiniBand vs Ethernet (both 400 Gb/s) • InfiniBand: RDMA, kernel bypass, ~1μs latency. Built for HPC. Expensive. • Ethernet: Standard networking, ~10-50μs latency. Cheaper, more flexible. RoCE bridges the gap. Rule: InfiniBand for training (latency-sensitive all-reduce), Ethernet+RoCE often fine for inference.
KV Cache
KV Cache Memory Formula
2 × layers × kv_heads × head_dim × seq_len × bytes
Llama-70B with 8K context: ~2.6GB per sequence
Often exceeds model weights for long contexts
This is why PagedAttention matters
Click to flip
Details
KV cache stores Key and Value projections for all previous tokens to avoid recomputation. For large models with long contexts, KV cache memory can exceed model weight memory. This is the fundamental memory challenge of LLM serving.
Batching
Continuous Batching
Static batching: wait for longest sequence
Continuous: insert new requests as others finish
Iteration-level scheduling, not request-level
Used by vLLM, TensorRT-LLM, all modern engines
Click to flip
Details
Static batching wastes GPU cycles waiting for the longest sequence. Continuous batching schedules at the iteration level—as soon as one request finishes a token, the slot can be used for a new request. Eliminates head-of-line blocking.
Attention
FlashAttention
Standard attention: O(N²) memory in HBM
FlashAttention: tiles Q,K,V into SRAM blocks
Online softmax with correction
No O(N²) memory, 2-4x speedup
Click to flip
Details
Standard attention materializes the full N×N attention matrix in HBM. FlashAttention tiles the computation into blocks that fit in SRAM, computes partial softmax with online correction. Result: linear memory, 2-4x speedup. Now the default in all inference engines.
Memory Management
PagedAttention
Problem: variable seq lengths cause fragmentation
Solution: allocate KV cache in fixed blocks
Like OS virtual memory pages
Near-zero fragmentation, enables sharing
Click to flip
Details
Variable sequence lengths cause memory fragmentation in naive allocation. PagedAttention allocates KV cache in fixed-size blocks with a block table for indirection. Achieves near-zero fragmentation and enables memory sharing for beam search and prefix caching. Core innovation behind vLLM.
Quantization
Quantization Strategies
Weight-only: INT4/8 weights, FP16 activations
Full: FP8/INT8 for weights AND activations
Weight-only: reduces memory, speeds memory-bound decode
Full: faster compute on Tensor Cores
Click to flip
Details
Weight-only quantization (AWQ, GPTQ) reduces memory and speeds up memory-bound decode phase. Full quantization (FP8 on Hopper/Blackwell) enables faster compute. Trade-off: accuracy vs throughput. FP8 preferred over INT8 because no calibration needed.
Quantization
Quantization Methods
EXL2: Fastest on NVIDIA, decimal bit-rates
AWQ: High-performance, protects key weights
K-Quants (GGUF): CPU/GPU split, AMD/Apple friendly
GPTQ: Old reliable, universal compatibility
HQQ: Fast to create, no calibration needed
Click to flip
Details
EXL2 (ExLlamaV2): NVIDIA only, fastest inference, decimal bit-rates (4.65-bit). AWQ: NVIDIA + newer AMD, high-performance, preserves "intelligence" by protecting important weights. K-Quants (GGUF): AMD, Apple Silicon, CPU. Flexible GPU/RAM split for massive models. GPTQ: Universal, "old reliable." Widely supported, stable. HQQ: Universal, quantize in minutes without calibration dataset.
Parallelism
Tensor Parallelism (TP)
Split attention heads + FFN across GPUs
Each GPU holds 1/N of weights
All-reduce to combine partial results
Low latency, needs fast interconnect (NVLink)
Click to flip
Details
Tensor parallelism splits within layers—attention heads and FFN columns distributed across GPUs. Requires all-reduce communication after each layer. Low latency for single requests but needs high-bandwidth interconnect (NVLink). Use for latency-sensitive serving, typically 2-8 GPUs.
Parallelism
Pipeline Parallelism (PP)
Different GPUs hold different layers
Data flows through pipeline
Micro-batching hides bubble overhead
Lower communication than TP, higher latency
Click to flip
Details
Pipeline parallelism splits across layers—GPU 1 holds layers 1-20, GPU 2 holds layers 21-40, etc. Lower communication requirements than TP but higher latency due to pipeline bubbles. Use for training or when model does not fit with TP alone. Combine with TP for very large models.
Optimization
Speculative Decoding
Draft model generates K candidates cheaply
Target model verifies in single parallel pass
Accept up to first mismatch
2-3x latency improvement, no quality loss
Click to flip
Details
Small draft model generates K candidate tokens quickly. Large target model verifies all K in a single forward pass (parallel!). If correct, accept all K; if wrong, accept up to first mismatch. Works because verification is parallel while generation is sequential. 2-3x speedup for latency.
Architecture
Disaggregated Serving
Prefill = compute-bound workload
Decode = memory-bound workload
Why run both on same hardware?
Separate prefill nodes from decode nodes
Click to flip
Details
Prefill is compute-bound, decode is memory-bandwidth-bound—different resource profiles. Disaggregated architecture uses prefill nodes with high compute and decode nodes optimized for memory bandwidth. Transfer KV cache between them. Emerging pattern: Mooncake, DistServe.
Fundamentals
Arithmetic Intensity
AI = FLOPs / Bytes moved
Compare to hardware ops:byte ratio (~500 for H100)
If AI < ratio → memory-bound
Decode AI ≈ 1-2 → always memory-bound
Click to flip
Details
Arithmetic intensity is FLOPs per byte moved. H100 has ~500 ops:byte ratio for FP16. Decode phase has AI of 1-2 (one token output, read all weights), so it is always memory-bound. This explains why batching helps: amortize weight loading across multiple sequences.
Hardware
GPU Comparison
H100: 80GB HBM3, 3.35 TB/s, NVLink 900 GB/s
H200: 141GB HBM3e, 4.89 TB/s
MI300X: 192GB HBM3, 5.3 TB/s (best memory)
Blackwell B200: FP4 support, 2x H100
Click to flip
Details
NVIDIA dominates with software ecosystem (CUDA, TensorRT). AMD MI300X has superior memory specs but weaker software. Blackwell introduces FP4 support for even more aggressive quantization. Know the specs, but software maturity often matters more.
Frameworks
vLLM vs TensorRT-LLM
vLLM: PagedAttention, Python, flexible
TensorRT-LLM: NVIDIA optimized, compiled graphs
SGLang: RadixAttention for prefix sharing
Choice: flexibility vs raw performance
Click to flip
Details
vLLM: great for research/startups, multi-hardware support, PagedAttention innovation. TensorRT-LLM: best raw performance on NVIDIA, compiled computation graphs, production-optimized. SGLang: RadixAttention for efficient prefix caching. Choice depends on flexibility vs performance, NVIDIA-only vs multi-hardware.
Training
Training Parallelism
Data Parallel (DP/FSDP): replicate model, split data
Tensor Parallel: split ops within layers
Pipeline Parallel: split layers across GPUs
Expert Parallel: MoE routing
Click to flip
Details
Modern training uses 3D/4D parallelism combining DP, TP, PP, and Expert Parallel. FSDP shards optimizer states across DP ranks for memory efficiency. Key insight: different parallelism strategies for different model sizes and hardware configurations.
Training
Checkpointing & Fault Tolerance
Checkpoint every N steps (model + optimizer state)
Activation checkpointing: recompute vs store
MTBF drops at scale
Elastic training: survive node failures gracefully
Click to flip
Details
CHECKPOINTING: • Full checkpoint: weights + optimizer states + LR scheduler. Can be 3x model size (Adam stores m, v per param). • Frequency trade-off: more checkpoints = more I/O overhead. ACTIVATION CHECKPOINTING: • Trade compute for memory. Recompute during backward pass. • Reduces memory ~50%, increases compute ~30%. FAULT TOLERANCE: • 1000+ GPU clusters have node failures daily. • Elastic training frameworks (DeepSpeed, PyTorch Elastic) can recover without full restart.
Training
Mixed Precision Training
FP32 master weights, FP16/BF16 for compute
Loss scaling to prevent gradient underflow
BF16 preferred: same range as FP32
2x memory savings, faster tensor core ops
Click to flip
Details
MIXED PRECISION: • Store master weights in FP32 for stability. • Compute forward/backward in FP16 or BF16 for speed. • Accumulate gradients in FP32. FP16 vs BF16: • FP16: 5-bit exponent, 10-bit mantissa. Narrow range, needs loss scaling. • BF16: 8-bit exponent, 7-bit mantissa. Same range as FP32, no loss scaling needed. LOSS SCALING: • FP16 gradients can underflow (become zero). • Multiply loss by scale factor, divide gradients after. • Dynamic loss scaling adjusts automatically. BENEFIT: 2x memory savings, 2-3x faster on tensor cores.
Training
Gradient Accumulation
Simulate larger batch sizes without more memory
Accumulate gradients over K micro-batches
Only update weights after K steps
Trade-off: same compute, longer wall-clock time
Click to flip
Details
WHY: Large batch sizes improve training stability but require more memory. HOW: • Forward + backward on micro-batch 1, store gradients • Forward + backward on micro-batch 2, accumulate gradients • ... repeat K times • Apply optimizer step with accumulated gradients EFFECT: • Effective batch size = micro_batch × K × num_GPUs • Memory = memory for 1 micro-batch • Time = K× slower per optimizer step USE WHEN: • GPU memory limits batch size • Need large effective batch for stability (common in LLM training)
Hardware
TPU vs GPU
TPU: Google custom ASIC, optimized for matrix ops
GPU: General-purpose, NVIDIA ecosystem dominates
TPU: Cheaper at scale, but Google Cloud only
GPU: CUDA ecosystem, more flexible
Click to flip
Details
GOOGLE TPU: • Custom ASIC for neural network workloads • Systolic array architecture: optimized for matrix multiply • ICI (Inter-Chip Interconnect): 4800 Gbps • Cheaper per FLOP at Google scale • Limitation: Google Cloud only, JAX/TensorFlow preferred NVIDIA GPU: • General-purpose, excels at parallel compute • CUDA ecosystem: mature, broad library support • NVLink for multi-GPU, InfiniBand for multi-node • Works everywhere: cloud, on-prem, consumer KEY DIFFERENCES: • TPU: Better price/perf at Google scale, less flexible • GPU: Ecosystem dominance, runs anywhere • TPU pods scale better for large clusters • GPU wins for inference diversity, TPU wins for large-scale Google training