Inference

Operating Distributed Inference Systems at Scale

TalkIntermediate

Inference has rapidly become one of the most important infrastructure problems in modern computing. As AI systems evolve into autonomous agents with persistent memory, tool usage, and multi-step reasoning, traditional inference architectures struggle under growing demands for latency, throughput, cost efficiency, and reliability. In this talk, I’ll share lessons from building large-scale elastic compute and AI infrastructure systems powering production workloads. We’ll explore the modern inference stack and the architectural patterns emerging to support next-generation agentic AI systems. Topics include distributed inference architectures for large-scale AI systems, GPU scheduling and elastic compute for inference workloads, multi-tenant inference infrastructure, caching, batching, latency optimization strategies, reliability and fault isolation for inference systems, observability and control loops for AI serving platforms, balancing cost, throughput, and user experience, and why inference is becoming an infrastructure orchestration problem. Attendees will gain practical insights into designing scalable, resilient, and cost-efficient inference platforms for modern AI workloads. Speakers: Nishant Gupta — Meta; Naman Ahuja — Meta.

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.

When

Thursday, July 2, 2026

10:45 AM - 11:05 AM·20m

Where

Track 9 · Room 2016

Capacity: 250 attendees

Speakers (2)

Nishant Gupta

Tech Lead, Software Engineering @ Meta SuperIntelligence Lab (MSL) • AI Infrastructure • Distributed Systems • Researcher • Speaker • Startup Advisor

Meta

# Introduction I am a Staff Software Engineer and Researcher at Meta, specializing in large-scale distributed systems and applied AI. I am passionate about building reliable, scalable, and intelligent infrastructure that powers the next generation of agentic workflows. With deep expertise spanning large-scale distributed systems, agentic infrastructure, systems architecture, and operational resilience, I focus on solving the hardest problems at the intersection of systems, AI, and real-world execution where theory meets engineering tradeoffs. Within Meta SuperIntelligence Lab (MSL), I have contributed to building agentic infrastructure - systems where AI agents operate within structured distributed environments, interacting with monitoring, scheduling, and feedback loops. My work in this space focuses on: - Evaluation and auditing of AI-driven decision systems in high-stakes production environments - Reliability, safety, and human oversight in autonomous and semi-autonomous systems - Designing feedback mechanisms to align system behavior with user and operational goals - Measuring real-world impact beyond offline metrics # Building Elastic Compute Infrastructure at Meta I also built the next-generation of elastic compute infrastructure to increase overall fleet utilization responsible for managing ~30% of Meta’s capacity (tens of millions of servers) across ~20 geo-distributed datacenter saving billions of dollars in Capex. This also involved partnering with VPs across Ads/Whatsapp/IG/Finance/Infra to set multi-year roadmap and strategy for increasing fleet-wide efficiency. # Research At Meta, my recent research includes Dynamic Idle Resource Leasing to Safely Oversubscribe Capacity at Scale, where I designed and deployed a production system that improves datacenter utilization by leasing idle capacity while preserving reliability and strict SLO guarantees. This work required building rigorous evaluation frameworks spanning simulation, controlled experimentation, and real-world safety validation - balancing algorithmic optimization with operational risk. The system has delivered measurable infrastructure-efficiency gains at production scale. I have also authored papers with 90+ citations. # What I care about I think deeply about how distributed services communicate, self-coordinate, and act with reliability under ambiguity. My work is rooted in understanding latency, correctness, failure modes, and semantic interoperability - not just performance on benchmarks, but real-world outcomes that matter in production. I’ve led teams and initiatives that: - Architect complex distributed platforms that serve high-availability workloads at scale - Design agentic systems and frameworks that enable coordinated autonomous behavior across services and models - Build operationally robust infrastructure with strong observability, fault tolerance, and graceful degradation - Translate cutting-edge research into developer-ready systems and patterns # Education I graduated from University of California, Los Angeles (UCLA) with a Master's in Computer Science in December 2019. At UCLA, my focus area was on building scalable distributed systems leveraging Machine Learning. # Ways to collaborate: • Keynotes, conference talks, and technical workshops • Partnerships with AI platforms, developer tools, and education organizations • Advisory and consulting on AI infrastructure and large-scale systems For speaking, partnerships, or advisory inquiries: nishantgupta@g.ucla.edu

Operating Distributed Inference Systems at Scale

TalkIntermediate

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.