AI Engineer World's Fair 2026

1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code

TalkIntermediate

We ran 1,000 automated tasks through a production code interpreter sandbox — file I/O, package installs, data analysis, ML training, binary downloads, multi-language execution — and tracked every failure. 88% passed. The other 12% revealed 18 distinct failure modes that no unit test would catch: binary encoding corruption in the transport layer, null bytes silently truncating file downloads, pip blocked by network isolation with no useful error, and path traversal inputs accepted without validation. This talk walks through the experiment design, the findings ranked by severity, and what we changed. If you are building or operating sandboxed execution for AI agents, these are the bugs waiting for your customers to find first.

About the Sandbox & Platform Engineering Track

Sandbox & Platform Engineering sessions at AI Engineer World's Fair 2026 in San Francisco.

1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code

About the Sandbox & Platform Engineering Track

When

Where

Speaker

1,000 Agent Tasks in a Sandbox: What Breaks When LLMs Write and Run Code

About the Sandbox & Platform Engineering Track

When

Where

Speaker