We ran 1,000 automated tasks through a production code interpreter sandbox — file I/O, package installs, data analysis, ML training, binary downloads, multi-language execution — and tracked every failure. 88% passed. The other 12% revealed 18 distinct failure modes that no unit test would catch: binary encoding corruption in the transport layer, null bytes silently truncating file downloads, pip blocked by network isolation with no useful error, and path traversal inputs accepted without validation. This talk walks through the experiment design, the findings ranked by severity, and what we changed. If you are building or operating sandboxed execution for AI agents, these are the bugs waiting for your customers to find first.
Sandbox & Platform Engineering sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
2:25 PM - 2:45 PM·20m
Track 1 · Room 2010
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Kevin Orellana
Software Engineer
Amazon Web Services
@KevssOrellana
Software engineer at Amazon Web Services focused on runtime tools that let AI agents execute code.