AI Engineer World's Fair 2026

Loophole - Adversarial Agents To Stress Test Your Morality

TalkIntermediate

Most natural language specifications have holes their authors didn't notice - and writing more rules tends to create more holes. I built Loophole to try a different approach: point adversarial agents at a spec until it stops breaking. You give the system a set of natural language principles. An AI drafts a formal codified version. Two adversarial agents go to work - one finds cases the code permits but the principles forbid, the other finds cases the code forbids but the principles allow. A judge agent patches the code when it can, but only if the fix doesn't contradict any prior ruling. When a contradiction can't be resolved, it escalates to you. Every decision becomes binding precedent, so the constraint space tightens round after round. I started with moral and legal reasoning as the demo, and on its own that's already interesting - it turns into a kind of game where you discover contradictions in your own beliefs that you didn't know were there. But the pattern generalizes well past that. The same loop works for company policies that need to survive contact with edge cases. For making chatbot system prompts adversarially robust. For stress-testing eval rubrics. And, taking the long view, for something like a smarter legislative process - where proposed laws get checked against the public's stated values before they pass, and the contradictions surface before they hit a courtroom. The talk walks through how the harness works, the design choices that matter (especially why precedent is the load-bearing piece), what kinds of specs it handles well, where it breaks, and what it would take to push it further. All code is open source.

About the Harness Engineering Track

Harness Engineering sessions at AI Engineer World's Fair 2026 in San Francisco.

Loophole - Adversarial Agents To Stress Test Your Morality

TalkIntermediate

About the Harness Engineering Track

Harness Engineering sessions at AI Engineer World's Fair 2026 in San Francisco.

Loophole - Adversarial Agents To Stress Test Your Morality

About the Harness Engineering Track

When

Where

Speaker

Loophole - Adversarial Agents To Stress Test Your Morality

About the Harness Engineering Track

When

Where

Speaker