Microsoft Core AI VP Interview Prep

Interview Loop

6 interviews · 3 completed

Mar 2 · 2:00-3:00 PM

Pablo Castro

pcastro@microsoft.com

Technical

📊 View Deck

Walk me through a major technical decision and its tradeoffs

What technical bet didn't work out? What did you learn?

How do you balance org politics with technical architecture?

Example of navigating competing technical stacks?

How do you stay close to technical details as a senior leader?

Click to flip

My Prepared Answer

EXAMPLE: Single Ads AI Agent Stack Situation: My team built first agent runtime, scaled to 60M+ sessions/year. Frontend Ads built their own stack—political/org reasons, wanted independence from consumer. Challenge: Multi-year unification battle. Two stacks: "Service Agent" vs "Product Agent." Resolution: SVP eventually directed single stack. Worked with senior director through productivity council (all LA+ engineering leads). Key decisions: Stack architecture, when to unify vs wait, organizational alignment. Lesson: Org design and relationships matter as much as technical architecture. --- ALTERNATE EXAMPLE --- Email Automation Evolution: • Elixir: Direct F1 calls, 2 CUJs • OSA Studio: Democratized plan building (org challenge) • Catalyst Autoplan: Routing challenge • Catalyst: Current unified approach Technical bet that didn't work: Routing complexity in OSA. Lesson: Organizational readiness matters as much as technical capability. 📌 Question to Ask: "As a Distinguished Engineer leading a team, how do you stay close to the code? What's your workflow for balancing leadership with technical depth?"

Mar 13 · 11:00 AM-12:00 PM

Bilal Alam

bialam@microsoft.com

Systems Design

Design an LLM inference system at scale

How do you think about reliability vs velocity tradeoffs?

Walk me through a major systems migration you led

How do you approach capacity planning for AI workloads?

What's your framework for build vs buy vs wait?

Click to flip

My Prepared Answer

DESIGN: LLM INFERENCE SERVING AT SCALE ═══ REQUEST FLOW ═══ Client → API Gateway → Auth → Rate Limiter → Router → Serving Pool → Response ═══ 1. API GATEWAY & AUTH ═══ • TLS termination, request validation • Auth: API keys, OAuth tokens, or managed identity • Extract: model_id, max_tokens, temperature, user_id • Quota lookup: which tier? (free/pro/enterprise) ═══ 2. THROTTLING & RATE LIMITING ═══ • Token bucket per user/org (e.g., 100K tokens/min) • Separate limits: requests/min AND tokens/min • Priority queues: paid > free, short > long • Backpressure: 429 with Retry-After header • Circuit breaker: if pool unhealthy, fail fast ═══ 3. ROUTING ═══ Model Router decides WHERE to send: • Model affinity: GPT-4 → Pool A, Claude → Pool B • Latency SLO: p50 < 200ms? Route to fastest pool • Cost optimization: batch small requests together • Geographic: route to nearest region Context-Aware Routing: • Prefix caching: same system prompt? Route to GPU with cached KV • Session affinity: multi-turn → same server (KV cache reuse) • Load balancing: least-connections or weighted round-robin ═══ 4. POOLS & POD MANAGEMENT ═══ • Pool = group of GPU nodes serving same model • Pod = K8s unit (1-8 GPUs with tensor parallelism) • Autoscaling: scale on queue depth, not CPU - Scale-up trigger: queue > 100 requests for 30s - Scale-down: idle GPUs for 5 min (expensive!) • Warm pools: keep N pods hot for burst traffic • Preemption: spot/preemptible for batch, on-demand for realtime ═══ 5. CONTINUOUS BATCHING (ELI5) ═══ OLD WAY (Static Batching): Imagine a bus that waits for 8 passengers, drives to destination, then comes back. If passenger 1 wants to go 1 mile and passenger 8 wants 10 miles, everyone waits for the 10-mile trip. NEW WAY (Continuous Batching): Imagine a bus that picks up/drops off passengers at EVERY stop. Passenger 1 gets off at mile 1, and a NEW passenger gets on immediately. The bus is always full and always moving. TECHNICAL: • Iteration = 1 decode step (generate 1 token per sequence) • After each iteration, check: any sequence done? • If done → evict from batch, slot open • If slot open → insert waiting request immediately • Result: no head-of-line blocking, GPU always saturated 📌 Question: "What's your framework for deciding when to build a new service versus extending an existing one?"

Mar 5 · 11:30 AM-12:30 PM

Rajneesh Singh

rajneeshs@microsoft.com

Culture

Tell me about a significant failure and what you learned

How do you create psychological safety on your teams?

Example of growth mindset in action (not just slogans)?

How do you scale culture through your managers?

How do you lead through organizational change or crisis?

Click to flip

My Prepared Answer

EXAMPLE 1: EMR (Engineering Metrics Review) Manager wanted to replicate sales-style metrics review. Support team stalled 6 months claiming difficulties. I led by example—built dashboard from what we DID know rather than waiting for perfect. Started the process, fixed it monthly. Now a cultural norm. Team owns and improves it without me. Lesson: Start imperfect, iterate openly, empower others. EXAMPLE 2: Support Lead Post-Layoff Led team through org change after layoffs. Prioritized emotional support first—acknowledged loss, created space to process. Maintained open dialogue. Built Marketing Advisor team during this period—crisis created opportunity to rebuild culture intentionally. EXAMPLE 3: Swim Lane Tracker Introduced to show work distribution across teams. Made hidden work visible, surfaced load imbalances. Data revealed patterns for rebalancing. Teams saw fairness because it was transparent. Lesson: Build systems that make the right thing visible. PHILOSOPHY: Growth mindset in action, not slogans. Crisis can accelerate culture change when you lead with empathy first, then systems. 📌 Question to Ask: "You spent 16 years at Amazon building teams. What surprised you most about Microsoft's culture when you joined, and what do you think Microsoft does better?"

COMPLETED

Feb 24 · 4:00-5:00 PM

Yina Arenas

yinaa@microsoft.com

XFN Collaboration

Tell me about aligning product, eng, and research on a contentious decision

How do you deliver outcomes across orgs while balancing speed and rigor?

Example of decision-making under ambiguity with multiple stakeholders?

How do you handle disagreements between partner teams?

When have you had to rebuild trust after a misstep?

Click to flip

My Prepared Answer

EXAMPLE 1: APaS Partnership (Proactive Success) Context: Ads Platform as a Service (APaS) partnership on transparency for policy decisions in Google Ads. Stakeholders: Marwan (PM), Jordan (Eng Lead) from APaS. Approach: Aligned early on shared goals—transparency for advertisers. Built trust from the start, not recovery. Two workstreams: 1. Transparent answers for creative disapprovals 2. Streaming data from account suspension ML models Result: Successful ongoing partnership. Example of XFN done right from day one. ——— EXAMPLE 2: TAI Relationship (Recovery) Situation: Got VP greenlight directly but frustrated TAI/GBAI by bypassing normal channels. Action: Rebuilt trust through regular syncs. Created shared roadmap visibility. Result: Cross-org alignment with GBO, gTech, TAI, GBAI. Lesson: Even with exec sponsorship, bring teams along. Speed without buy-in creates debt. 📌 Question to Ask: "With Foundry touching models, tooling, and runtime—how do you navigate when research wants to ship fast versus product wanting enterprise readiness?"

COMPLETED

Feb 27 · 1:00-2:00 PM

Scott Van Vliet

svanvliet@microsoft.com

AI First

What's your experience building AI/ML products?

How do you stay current with AI developments?

Example of applying AI to solve a real business problem?

How do you evaluate when to use AI vs traditional approaches?

What AI trends are you most excited about?

Click to flip

My Prepared Answer

WORK AI PROJECTS: • Marketing Advisor: Chrome extension → VM-based computer control agent. Prototype to GML announcement in 5 months. • UI Control for EMR: AI-powered engineering metrics review automation • Eunice/Uniss: Generalist agent runtime I built for work—multi-provider, multi-agent orchestration • Agentic email: 60M+ sessions/year, reduced case volume 264M→168M PERSONAL AI PROJECTS: • Eunice: My personal generalist agent CLI—runs Claude, GPT, Gemini with tool use • MacroMunch: AI-powered calorie tracker I built for myself • OpenClaw: My own implementation of computer control/UI automation • BlueBubbles + Claude: Integrated iMessage with Claude for personal assistant via text HOW I STAY CURRENT: • Daily hands-on: Build real projects, not just read about AI • Claude Code for all my development work • Local LLMs on my own hardware • Latent Space podcast, arXiv, Twitter/X AI community WHAT EXCITES ME ABOUT AI: The ability to remove toil so work becomes energizing. I want to CREATE MORE, not less. AI should amplify human capability, not replace human judgment. Every project I build is about making myself or my team more effective—shipping faster, automating the boring parts, focusing on what matters. 📌 Question to Ask: "What does success look like for this VP role in year one? What's the hardest problem you're hoping this person helps you solve?"

COMPLETED

Feb 27 · 3:30-4:30 PM

Eric Boyd

emboyd@microsoft.com

Leadership

How do you set vision and translate strategy into execution?

Tell me about holding a leader accountable for underperformance

How do you develop leaders at the director+ level?

Example of a difficult personnel decision and how you handled it?

How do you create organizational impact at scale?

Click to flip

My Prepared Answer

EXAMPLE: Underperformance Spectrum (4 examples, 4 outcomes) 1. PEYMON (PIP → Exit) Hired for AI experience, checked out from start. Over his head on delivery. Put on PIP, took option to leave. Lesson: Watch engagement early. 2. ERIC GAGNON (Managed Out) Promoted IC→manager, wanted the role. Just wanted power—dictator style. Ran multi-model stack, tons of infighting. Team mutinied, escalated to me. Lesson: Desire ≠ capability. 3. HOMAM (Right Person, Wrong Role) Started strong, couldn't coordinate cross-functionally at LX level. Wanted to advance but not technical enough. When I raised issues, he wanted to switch teams. I said I'd be honest with hiring managers. Switched, stayed at Amazon, succeeded in less technical roles. Lesson: Honesty enables better outcomes. 4. JYOTSNA (Turnaround Success) Current direct. Touch and go 6 months, now performing well. Lesson: Patience + clear feedback works. 📌 Question to Ask: "How do you think about the competitive landscape—Azure AI vs. AWS Bedrock vs. GCP Vertex? Where do you see Microsoft's durable advantage?"

Interviewers

6 profiles

DONE

Yina Arenas

CVP, Product - Microsoft Foundry

yinaa@microsoft.com

XFN Collaboration

Feb 24, 4:00-5:00 PM PST

Current Role

Leads product for Azure AI Foundry, empowering developers to build with generative AI. Portfolio includes Azure OpenAI in Foundry Models, AI Model Ecosystem, AI Agent Runtime, and end-to-end toolchain.

Background

Not an AI engineer by trade - built career at intersection of data, BI, and product
10+ years at Microsoft, pivotal role shaping Microsoft Graph
Strong background in platform thinking and developer ecosystems
Publicly speaks about AI and creative resistance

What's Assessed

Cross-org influence, decision-making under ambiguity, delivering outcomes across product/eng/research while balancing speed and rigor. Strategic discussion with selective depth - crisp framing over exhaustive detail.

Questions to Ask

You built your career in data and BI before leading AI Foundry. What surprised you most about leading an AI platform team compared to your Microsoft Graph days?

With Foundry touching models, tooling, and runtime—how do you navigate when research wants to ship fast versus product wanting enterprise readiness?

Microsoft Graph became essential infrastructure for M365. What's your vision for Foundry becoming that same "invisible backbone" for AI apps?

You've spoken about AI and creative resistance. What patterns do you see in teams that successfully adopt AI tooling versus those that struggle?

DONE

Scott Van Vliet

CVP, Azure OpenAI & AI Core Infrastructure

svanvliet@microsoft.com

AI First

Feb 27, 1:00-2:00 PM PST

Current Role

Leads Azure OpenAI and AI Core Infrastructure. This is your potential direct manager - the role reports to him.

Background

20+ years building tech products
Former CVP of Microsoft Teams and Azure Communication platforms
GM at Amazon for Alexa, Echo, and Appstore devices; led Amazon Irvine office
SVP of Software Engineering at Relativity Space (rocket company)
Executive at Mattel
Based in Los Angeles, CA

What's Assessed

Experience and genuine interest in AI, how you stay current with rapid developments, and how you apply AI to solve real-world problems. Expects concrete examples of building AI products, evaluating AI approaches, and demonstrating technical depth alongside business impact.

Questions to Ask

You've led teams at Amazon, Relativity Space, and now back at Microsoft. What drew you back, and what's different about building AI infrastructure versus real-time communications?

What does success look like for this VP role in year one? What's the hardest problem you're hoping this person helps you solve?

Coming from Teams—which ships to hundreds of millions—how do you think about balancing "move fast" startup energy with enterprise reliability in Azure OpenAI?

Leading software at a rocket company must have been unique. What lessons from Relativity Space shape how you think about engineering rigor in AI infrastructure?

How do you personally approach coaching your direct reports? What's something you've learned about developing VPs specifically?

DONE

Eric Boyd

CVP/Managing Director, AI Platform

emboyd@microsoft.com

Leadership

Feb 27, 3:30-4:30 PM PST

Current Role

Oversees AI tools, infrastructure, hardware, big data systems, and key datasets powering ML across Bing, Bing Ads, and Microsoft Office. Recently expanded role to CVP/Managing Director.

Background

Nearly a decade at Microsoft
Led Bing Ads engineering team including relevance AI, high-performance serving systems, big-data analytics
VP Engineering at Mochi Media (gaming platform)
9+ years at Yahoo! - rose to VP of Platform Engineering
BS in Computer Engineering and Mathematics from MIT

What's Assessed

Executive discussion grounded in real examples. Show how you set vision, translate strategy into execution, hold teams accountable, and develop leaders. Anchor to measurable outcomes and organizational impact.

Questions to Ask

Bing Ads serves billions of predictions daily. What architectural principles from that experience do you see as most critical for Azure AI Platform?

Congratulations on the expanded CVP/Managing Director role. What new challenges come with that scope, and how are you thinking about the AI Platform's next chapter?

You've built platform teams at Yahoo, Mochi Media, and Microsoft. What distinguishes the best platform engineers you've hired?

How do you think about the competitive landscape—Azure AI vs. AWS Bedrock vs. GCP Vertex? Where do you see Microsoft's durable advantage?

With nearly a decade leading AI Platform, how do you balance shipping new capabilities versus paying down technical debt in systems that can't go down?

Pablo Castro

CVP & Distinguished Engineer, CoreAI

pcastro@microsoft.com

Technical Retrospective

Mar 2, 2:00-3:00 PM PST

📊 View Deck

Current Role

Leads the AI Knowledge team in CoreAI division. Executive and hands-on engineer with track record of identifying industry trends.

Background

Distinguished Engineer - one of the highest technical ranks at Microsoft
Recent work on memory features in Microsoft Foundry
Agentic retrieval approaches in Azure AI Search
Has GitHub presence - still codes
Based in Redmond

What's Assessed

Deep dive on major technical decisions, tradeoffs, lessons learned, and how those shape CVP-level judgment. Expects thoughtful reflection on pivotal decisions, what worked/didn't, and how lessons inform future leadership.

Questions to Ask

As a Distinguished Engineer leading a team, how do you stay close to the code? What's your personal workflow for balancing leadership with technical depth?

RAG architectures are evolving fast—agentic retrieval, memory, hybrid search. What bets is your team making on where retrieval is heading?

You've been identifying industry trends for years. Looking back, what's a major technical bet you made that didn't pan out, and what did you learn?

Distinguished Engineer is rare at Microsoft. How do you think about your influence—through code, architecture reviews, mentorship, or something else?

The memory features in Foundry are interesting for agent continuity. What's the hardest problem in making AI systems that remember context well?

Bilal Alam

Technical Fellow, Developer Division

bialam@microsoft.com

Systems Design

Mar 13, 11:00 AM-12:00 PM PST

Current Role

Technical Fellow in Developer Division focusing on Azure services. Manages large engineering team overseeing services with significant external revenue.

Background

25+ years at Microsoft - deep institutional knowledge
Founder of Azure App Service
Founder of Azure Functions
Founder of Azure Container Apps
Founder of Azure API Management
Founder of Azure Logic Apps
Founder of Azure Static Web Apps

What's Assessed

Systems thinking, scalability, reliability, and long-term architectural tradeoffs. Technical discussion focused on conceptual depth over code-level detail. How you approach system design decisions, risk management, and evolution over time.

Questions to Ask

You've founded Azure Functions, App Service, Container Apps, and more. What's your framework for deciding when to build a new service versus extending an existing one?

Azure Functions pioneered serverless at Microsoft. How do you see serverless patterns evolving with AI workloads—especially for inference?

25 years at Microsoft is remarkable. How has your approach to systems design evolved? What did you believe early in your career that you now think differently about?

As a Technical Fellow, you're at the pinnacle of the technical ladder. How do you use that position to shape Azure's technical direction?

Your services have millions of users and significant revenue. What's the hardest part of maintaining and evolving systems at that scale without breaking customers?

Rajneesh Singh

VP Engineering, Agent Foundry - CoreAI

rajneeshs@microsoft.com

Growth Mindset & One Microsoft

Mar 5, 11:30 AM-12:30 PM PST

Current Role

VP of Engineering for Agent Foundry in CoreAI. Leads the platform for building, deploying, and managing AI agents. Joined Microsoft in Oct 2025 from AWS.

Background

16 years at Amazon/AWS before Microsoft (2009-2025)
At AWS: Director of GenAI Platform (SageMaker HyperPod, training jobs & frameworks)
At AWS: GM & Director for SageMaker Canvas, Forecast, Data Wrangler (no-code ML)
At Amazon Retail: Led Product Detail Page and Variation Core Technology
B.Tech CS from IIT Delhi, PGSEM from IIM Bangalore

What's Assessed

Reflective conversation valuing authenticity and self-awareness. Examples of learning loops, handling setbacks, coaching others, and scaling culture through managers. Focus on behaviors, not slogans.

Questions to Ask

You spent 16 years at Amazon building teams. What surprised you most about Microsoft's culture when you joined, and what do you think Microsoft does better?

You led SageMaker Canvas—democratizing ML for non-engineers. How does that "lower the barrier" philosophy carry into how you think about Agent Foundry?

Your colleagues describe you as someone who creates paths forward in complex problem spaces. Can you give an example of navigating ambiguity in your first months at Microsoft?

Growth mindset is a Microsoft value, but it can become a slogan. How do you make it real in your team's daily work and hiring?

You've built teams at massive scale at Amazon. What's your framework for recruiting and developing leaders—what do you look for?

Agent Foundry is the platform for AI agents at Microsoft. What's the hardest cultural challenge in getting a new platform adopted across a company this large?

Experience Stories

13 cards

Stats

Google Ads Support Scale

support.google.com is 18th largest website globally

2 billion visits per week (exceeds Netflix)

5,000 Ads Specialists worldwide

30,000 MAU (Monthly Active Users) on support tools

Click to flip

Details

Case volume reduced from 264M to 168M annually (22M to 14M/month). This represents massive automation success while maintaining quality. Marketing Advisor has 10,000 users in Beta.

XFN Collaboration

Cases AI Agent Cross-Org

Reimagining Support Agent roles with AI

Cross-org: GBO, gTech, TAI (Trust & AI), GBAI

VP-level greenlight, but bypassed channels

Rebuilt trust through regular syncs

Click to flip

Details

Next-gen "practitioner-in-the-loop" AI experience. Initially got VP greenlight directly but frustrated partner teams by bypassing normal channels. Learned to balance urgency with stakeholder engagement—even with exec sponsorship, need to bring teams along. Rebuilt trust by establishing regular syncs with TAI and GBAI, creating shared roadmap visibility.

Growth Mindset

Listening to Engineers → Strategy

Sr. Engineer raised VM auth concern

AI agent can't use OAuth for Ads accounts

Planted seed in multi-VP review

Now baked into contract

Click to flip

Details

A skip-skip level engineer raised a concern about VM authentication—our AI agent needs to login to Ads accounts but cannot use OAuth, requiring Ads Platform approval. I listened, took it seriously, then planted the seed by adding it as a requirement in a multi-VP review. Now when we need the negotiation, it is not a net-new ask. Strategic foresight came from listening to an IC engineer.

Growth Mindset

Lead by Example: EMR

Built first monthly Eng health review myself

27 unique assets, borrowed patterns from Sales

Used for 2 years across org

Transferred ownership when team member inspired

Click to flip

Details

Created EMR (Engineering Metrics Review)—a total hack that worked. When a team member wanted to "clean it up," I encouraged him and admitted openly it was a hack. He did not have access, so I used my own AI agent to grant permissions from a spreadsheet. Shared in Eng Managers chat with no expectations. Modeling humility, encouraging ownership, removing blockers quietly.

Growth Mindset

Post-Layoff Transformation

Led team through org change after layoffs

Prioritized emotional support first

Shifted from silos to collaboration

Swim Lane Tracker made work visible

Click to flip

Details

Prioritized comfort first—acknowledged the loss, created space for people to process, maintained open dialogue. Introduced Swim Lane Tracker showing work distribution across teams—made hidden work visible, surfaced load imbalances. Data revealed patterns that informed rebalancing decisions; teams saw fairness because it was transparent. Crisis can accelerate culture change.

Technical

Agentic Email: Phase 1 (Elixir)

Problem: 3.4M cases/yr need human help

Limited TPU capacity

Downselected to 2 CUJs (Critical User Journeys)

Direct F1 (Google DB) calls vs API layers

Click to flip

Details

8.4M cases/year, 5M automated with deterministic flows. Remaining 3.4M always needed humans. Limited TPU meant downselecting to Cancel Account (3K/yr) and Account Suspension (5K/yr). Team hesitant about direct DB calls—multiple API layers existed. Negotiated: we had data access, other teams used F1 directly. Accepted schema-change risk because Ads DB changes are slow/rare.

Technical

Agentic Email: Phase 2 (OSA Studio)

Attempt to "democratize" agent plan building

Frontend tooling done in weeks

Getting plans made was org challenge

Real challenge became routing—picking best plan

Click to flip

Details

OSA (One-Shot Agent) Studio was an attempt to democratize the process of building agent plans we made with Elixir. The frontend tooling was easy—built in a few weeks. Organizationally it was hard to get the plans made. But once we did, the challenge became Routing—picking the best plan for each case.

Technical

Marketing Advisor: Prototype → Prod

Jan 2025: Chrome Extension prototype

May 2025: Announced at GML (Google Marketing Live)

July 2025: Alpha launch

Evolution: Extension → VM Control

Click to flip

Details

From Chrome Extension prototype to production in 6 months. Built initial prototype as Chrome Extension, announced at Google Marketing Live, went into Alpha in July. Technical evolution: moved from Chrome Extension architecture to VM-based computer control for more robust and secure operation.

Leadership

Managing Underperformance

Amazon: 6% unregretted attrition target

Mechanized talent reviews into manager culture

Shifted from outcomes to behaviors

Reviews feel natural, not artificial

Click to flip

Details

Amazon context: mechanized talent reviews to make performance focus part of manager culture—nothing artificial, just consistent accountability. Evolution of thinking: used to judge JUST on outcomes. Now focus on inputs/behaviors—more actionable for the person to improve. Specific example (Peymon): came from accredited org, signs were there but took time to act. Came down to judging behaviors.

Leadership

Developing Directors

Muhammad Yahia: Promoted to L7, key leader

Rajat Dewan: Director promotion in progress

Philosophy: Coaching over prescription

Push strengths, minimize bad style parts

Click to flip

Details

Evolution: used to do "here is how I do it" → now far more personalized. Top talent has diverse styles—that is OK. Push into their strengths, minimize the bad style parts. Do not force your style on them. Key insight: coaching and empowerment over prescription.

Coaching

The Four Stages Framework

Stage 1: "Watch me do it"

Stage 2: "Help me do it"

Stage 3: "I'll help you do it"

Stage 4: "I'll watch you do it"

Click to flip

Details

Match stage to the person AND the specific skill—same person may be Stage 4 on execution but Stage 2 on exec communication. Stage 1: model the behavior, let them observe. Stage 2: they participate while you lead. Stage 3: they lead, you support. Stage 4: full ownership, you observe and provide retrospective coaching.

Coaching

Coaching Through Resistance

Jyotsna: skeptical of bold moves

Her team tried multiple agent techniques

Added Jason to build parallel prototype

Reintegrated teams with same product name

Click to flip

Details

Context: her team tried multiple agent email techniques over time; she was skeptical of bold moves and preferred small, incremental bites. I pushed for bigger "boulder" moves. Action: added Jason to build prototype independently. Reintegration: folded her team back in, gave space to assess. Outcome: everyone aligned. Kept same product name—feels like evolution, not replacement. Coaching insight: sometimes show, do not tell.

Coaching

Good Person, Wrong Role

Steven Pesci: skeptical of LLMs

Physics background → uncomfortable with probabilistic AI

Moved to Foundations team (6-person core systems)

Built MCP layer for F1 database—crushed it

Click to flip

Details

Steven came to me saying he was skeptical of LLMs. His physics background made him uncomfortable with the probabilistic nature of building LLM applications. Rather than losing a talented engineer, I asked if he wanted to work on our Foundations team—a small 6-person team managing core systems including the central F1 database. There was a new project to expose an MCP layer for this database in a secure way with replayability. Steven crushed it. Irony: the lower-stakes, more structured environment actually helped him get comfortable with vibe coding over time. Key lesson: he wasn't failing—he was mismatched. Listen to people's concerns, find the right fit for their strengths.

Systems Knowledge

19 cards

Inference Phases

Prefill vs Decode

Prefill: Compute-bound, all tokens parallel

Decode: Memory-bandwidth-bound, 1 token at a time

Prefill generates KV cache

Most optimization targets decode phase

Click to flip

Details

Prefill (prompt processing) processes all input tokens in parallel, generating the KV cache. Decode (generation) is sequential—generates one token at a time, reading entire KV cache per token. Decode is the bottleneck because it is memory-bandwidth-bound: you read all weights just to produce 1 token.

Memory

Memory Hierarchy

SRAM: ~50MB, 100+ TB/s (KV cache hot path)

HBM: 80-192GB, 2-5 TB/s (model weights)

NVLink: 900 GB/s (multi-GPU, 8 GPUs/node)

InfiniBand: 400 Gb/s (multi-node, low latency)

Ethernet: 400 Gb/s (multi-node, higher latency)

Click to flip

Details

MEMORY HIERARCHY (fast/small → slow/large): 1. SRAM (~50MB per GPU, 100+ TB/s) On-chip L2 cache + SM shared memory. FlashAttention tiles KV cache here. Goal: keep hot data in SRAM. 2. HBM (80-192GB, 2-5 TB/s) GPU main memory. Stores model weights, KV cache overflow. Memory-bandwidth is THE bottleneck for decode. 3. NVLink (900 GB/s per link) GPU-to-GPU within node. Enables tensor parallelism across 8 GPUs. All-reduce for attention/FFN splits. 4. InfiniBand vs Ethernet (both 400 Gb/s) • InfiniBand: RDMA, kernel bypass, ~1μs latency. Built for HPC. Expensive. • Ethernet: Standard networking, ~10-50μs latency. Cheaper, more flexible. RoCE bridges the gap. Rule: InfiniBand for training (latency-sensitive all-reduce), Ethernet+RoCE often fine for inference.

KV Cache

KV Cache Memory Formula

2 × layers × kv_heads × head_dim × seq_len × bytes

Llama-70B with 8K context: ~2.6GB per sequence

Often exceeds model weights for long contexts

This is why PagedAttention matters

Click to flip

Details

KV cache stores Key and Value projections for all previous tokens to avoid recomputation. For large models with long contexts, KV cache memory can exceed model weight memory. This is the fundamental memory challenge of LLM serving.

Batching

Continuous Batching

Static batching: wait for longest sequence

Continuous: insert new requests as others finish

Iteration-level scheduling, not request-level

Used by vLLM, TensorRT-LLM, all modern engines

Click to flip

Details

Static batching wastes GPU cycles waiting for the longest sequence. Continuous batching schedules at the iteration level—as soon as one request finishes a token, the slot can be used for a new request. Eliminates head-of-line blocking.

Attention

FlashAttention

Standard attention: O(N²) memory in HBM

FlashAttention: tiles Q,K,V into SRAM blocks

Online softmax with correction

No O(N²) memory, 2-4x speedup

Click to flip

Details

Standard attention materializes the full N×N attention matrix in HBM. FlashAttention tiles the computation into blocks that fit in SRAM, computes partial softmax with online correction. Result: linear memory, 2-4x speedup. Now the default in all inference engines.

Memory Management

PagedAttention

Problem: variable seq lengths cause fragmentation

Solution: allocate KV cache in fixed blocks

Like OS virtual memory pages

Near-zero fragmentation, enables sharing

Click to flip

Details

Variable sequence lengths cause memory fragmentation in naive allocation. PagedAttention allocates KV cache in fixed-size blocks with a block table for indirection. Achieves near-zero fragmentation and enables memory sharing for beam search and prefix caching. Core innovation behind vLLM.

Quantization

Quantization Strategies

Weight-only: INT4/8 weights, FP16 activations

Full: FP8/INT8 for weights AND activations

Weight-only: reduces memory, speeds memory-bound decode

Full: faster compute on Tensor Cores

Click to flip

Details

Weight-only quantization (AWQ, GPTQ) reduces memory and speeds up memory-bound decode phase. Full quantization (FP8 on Hopper/Blackwell) enables faster compute. Trade-off: accuracy vs throughput. FP8 preferred over INT8 because no calibration needed.

Quantization

Quantization Methods

EXL2: Fastest on NVIDIA, decimal bit-rates

AWQ: High-performance, protects key weights

K-Quants (GGUF): CPU/GPU split, AMD/Apple friendly

GPTQ: Old reliable, universal compatibility

HQQ: Fast to create, no calibration needed

Click to flip

Details

EXL2 (ExLlamaV2): NVIDIA only, fastest inference, decimal bit-rates (4.65-bit). AWQ: NVIDIA + newer AMD, high-performance, preserves "intelligence" by protecting important weights. K-Quants (GGUF): AMD, Apple Silicon, CPU. Flexible GPU/RAM split for massive models. GPTQ: Universal, "old reliable." Widely supported, stable. HQQ: Universal, quantize in minutes without calibration dataset.

Parallelism

Tensor Parallelism (TP)

Split attention heads + FFN across GPUs

Each GPU holds 1/N of weights

All-reduce to combine partial results

Low latency, needs fast interconnect (NVLink)

Click to flip

Details

Tensor parallelism splits within layers—attention heads and FFN columns distributed across GPUs. Requires all-reduce communication after each layer. Low latency for single requests but needs high-bandwidth interconnect (NVLink). Use for latency-sensitive serving, typically 2-8 GPUs.

Parallelism

Pipeline Parallelism (PP)

Different GPUs hold different layers

Data flows through pipeline

Micro-batching hides bubble overhead

Lower communication than TP, higher latency

Click to flip

Details

Pipeline parallelism splits across layers—GPU 1 holds layers 1-20, GPU 2 holds layers 21-40, etc. Lower communication requirements than TP but higher latency due to pipeline bubbles. Use for training or when model does not fit with TP alone. Combine with TP for very large models.

Optimization

Speculative Decoding

Draft model generates K candidates cheaply

Target model verifies in single parallel pass

Accept up to first mismatch

2-3x latency improvement, no quality loss

Click to flip

Details

Small draft model generates K candidate tokens quickly. Large target model verifies all K in a single forward pass (parallel!). If correct, accept all K; if wrong, accept up to first mismatch. Works because verification is parallel while generation is sequential. 2-3x speedup for latency.

Architecture

Disaggregated Serving

Prefill = compute-bound workload

Decode = memory-bound workload

Why run both on same hardware?

Separate prefill nodes from decode nodes

Click to flip

Details

Prefill is compute-bound, decode is memory-bandwidth-bound—different resource profiles. Disaggregated architecture uses prefill nodes with high compute and decode nodes optimized for memory bandwidth. Transfer KV cache between them. Emerging pattern: Mooncake, DistServe.

Fundamentals

Arithmetic Intensity

AI = FLOPs / Bytes moved

Compare to hardware ops:byte ratio (~500 for H100)

If AI < ratio → memory-bound

Decode AI ≈ 1-2 → always memory-bound

Click to flip

Details

Arithmetic intensity is FLOPs per byte moved. H100 has ~500 ops:byte ratio for FP16. Decode phase has AI of 1-2 (one token output, read all weights), so it is always memory-bound. This explains why batching helps: amortize weight loading across multiple sequences.

Hardware

GPU Comparison

H100: 80GB HBM3, 3.35 TB/s, NVLink 900 GB/s

H200: 141GB HBM3e, 4.89 TB/s

MI300X: 192GB HBM3, 5.3 TB/s (best memory)

Blackwell B200: FP4 support, 2x H100

Click to flip

Details

NVIDIA dominates with software ecosystem (CUDA, TensorRT). AMD MI300X has superior memory specs but weaker software. Blackwell introduces FP4 support for even more aggressive quantization. Know the specs, but software maturity often matters more.

Frameworks

vLLM vs TensorRT-LLM

vLLM: PagedAttention, Python, flexible

TensorRT-LLM: NVIDIA optimized, compiled graphs

SGLang: RadixAttention for prefix sharing

Choice: flexibility vs raw performance

Click to flip

Details

vLLM: great for research/startups, multi-hardware support, PagedAttention innovation. TensorRT-LLM: best raw performance on NVIDIA, compiled computation graphs, production-optimized. SGLang: RadixAttention for efficient prefix caching. Choice depends on flexibility vs performance, NVIDIA-only vs multi-hardware.

Training

Training Parallelism

Data Parallel (DP/FSDP): replicate model, split data

Tensor Parallel: split ops within layers

Pipeline Parallel: split layers across GPUs

Expert Parallel: MoE routing

Click to flip

Details

Modern training uses 3D/4D parallelism combining DP, TP, PP, and Expert Parallel. FSDP shards optimizer states across DP ranks for memory efficiency. Key insight: different parallelism strategies for different model sizes and hardware configurations.

Training

Checkpointing & Fault Tolerance

Checkpoint every N steps (model + optimizer state)

Activation checkpointing: recompute vs store

MTBF drops at scale

Elastic training: survive node failures gracefully

Click to flip

Details

CHECKPOINTING: • Full checkpoint: weights + optimizer states + LR scheduler. Can be 3x model size (Adam stores m, v per param). • Frequency trade-off: more checkpoints = more I/O overhead. ACTIVATION CHECKPOINTING: • Trade compute for memory. Recompute during backward pass. • Reduces memory ~50%, increases compute ~30%. FAULT TOLERANCE: • 1000+ GPU clusters have node failures daily. • Elastic training frameworks (DeepSpeed, PyTorch Elastic) can recover without full restart.

Training

Mixed Precision Training

FP32 master weights, FP16/BF16 for compute

Loss scaling to prevent gradient underflow

BF16 preferred: same range as FP32

2x memory savings, faster tensor core ops

Click to flip

Details

MIXED PRECISION: • Store master weights in FP32 for stability. • Compute forward/backward in FP16 or BF16 for speed. • Accumulate gradients in FP32. FP16 vs BF16: • FP16: 5-bit exponent, 10-bit mantissa. Narrow range, needs loss scaling. • BF16: 8-bit exponent, 7-bit mantissa. Same range as FP32, no loss scaling needed. LOSS SCALING: • FP16 gradients can underflow (become zero). • Multiply loss by scale factor, divide gradients after. • Dynamic loss scaling adjusts automatically. BENEFIT: 2x memory savings, 2-3x faster on tensor cores.

Training

Gradient Accumulation

Simulate larger batch sizes without more memory

Accumulate gradients over K micro-batches

Only update weights after K steps

Trade-off: same compute, longer wall-clock time

Click to flip

Details

WHY: Large batch sizes improve training stability but require more memory. HOW: • Forward + backward on micro-batch 1, store gradients • Forward + backward on micro-batch 2, accumulate gradients • ... repeat K times • Apply optimizer step with accumulated gradients EFFECT: • Effective batch size = micro_batch × K × num_GPUs • Memory = memory for 1 micro-batch • Time = K× slower per optimizer step USE WHEN: • GPU memory limits batch size • Need large effective batch for stability (common in LLM training)

Hardware

TPU vs GPU

TPU: Google custom ASIC, optimized for matrix ops

GPU: General-purpose, NVIDIA ecosystem dominates

TPU: Cheaper at scale, but Google Cloud only

GPU: CUDA ecosystem, more flexible

Click to flip

Details

GOOGLE TPU: • Custom ASIC for neural network workloads • Systolic array architecture: optimized for matrix multiply • ICI (Inter-Chip Interconnect): 4800 Gbps • Cheaper per FLOP at Google scale • Limitation: Google Cloud only, JAX/TensorFlow preferred NVIDIA GPU: • General-purpose, excels at parallel compute • CUDA ecosystem: mature, broad library support • NVLink for multi-GPU, InfiniBand for multi-node • Works everywhere: cloud, on-prem, consumer KEY DIFFERENCES: • TPU: Better price/perf at Google scale, less flexible • GPU: Ecosystem dominance, runs anywhere • TPU pods scale better for large clusters • GPU wins for inference diversity, TPU wins for large-scale Google training

YouTube Resources

CUDA Programming Full Course freeCodeCamp Triton Inference Server NVIDIA Multi-GPU Fine-Tuning Hugging Face vLLM Paper Explained Umar Jamil