Salman Munaf

Lead Site Reliability Engineer

TikTok

Salman Munaf is a Lead Site Reliability Engineer at TikTok, where he builds and operates large-scale video infrastructure serving millions of users. He specializes in distributed systems, observability, and reliability at scale, with prior experience as a Software Engineer at Meta. Salman is passionate about helping developers embed reliability into their workflows from day one, making complex systems more resilient and easier to operate.

Sessions (1)

AI Agents Are Just Distributed Systems Now

2:50 PM·Leadership 1 · Room 3016

AI agents are often described as a new kind of software, but once they move beyond chat and start calling tools, reading data, making decisions, retrying tasks, and coordinating workflows, they begin to look a lot like distributed systems. They have state. They call external services. They depend on APIs. They fail partially. They retry. They time out. They can loop. They can act on stale context. They can produce inconsistent results. And when something goes wrong, teams need logs, traces, permissions, ownership, and rollback paths just like they do with any other production system. This session will give engineers a practical way to reason about AI agents using familiar distributed systems concepts. We will break down the agent loop: planning, tool use, observation, memory, and retries. Then we will map common agent failure modes to engineering patterns teams already know, including timeouts, circuit breakers, idempotency, rate limits, least privilege, observability, and human approval. The goal is to move past the hype and treat agents like real production systems. Attendees will leave with a clear mental model for designing, debugging, and operating agents safely, especially as they become part of customer-facing products, internal developer tools, and business workflows.

AI-Native Enterprisesintermediatetalk