AI Engineer World's Fair 2026

How Reducto parsed the Epstein Files for the Viral JMail Project: The Secret Complexities of Documents

TalkIntermediate

When the Epstein files dropped, a team of indie hackers built JMail: a duplicate of Gmail that was logged in as Jeffrey himself. It went viral. But the parsing problem underneath it was brutal. Court documents are some of the nastiest inputs a parser can face. Scanned exhibits with varying resolution, redactions sitting directly over key text, inconsistent formatting across decades of filings, handwritten annotations mixed into typed pages, documents photocopied from a photocopy of a photocopy. But legal is just one flavor of hard. In finance, you're dealing with tables nested inside tables, footnotes that span pages, and numbers that mean different things depending on which section of the filing you're in. In healthcare, it's mixed handwritten and typed content, inconsistent date formats, and forms that were designed in 1987 and never updated. In government records, it's degraded scans, stamps overlapping text, and documents where a key field is missing on half the corpus. Every industry has its own specific ways documents break parsers. This session walks through the failure modes we've actually hit across these corpora, what causes them, and how to build pipelines that hold up when the documents stop cooperating.

About the Expo Stage 2 Track

Expo Stage 2 sessions at AI Engineer World's Fair 2026 in San Francisco.

How Reducto parsed the Epstein Files for the Viral JMail Project: The Secret Complexities of Documents

About the Expo Stage 2 Track

When

Where

Speaker

How Reducto parsed the Epstein Files for the Viral JMail Project: The Secret Complexities of Documents

About the Expo Stage 2 Track

When

Where

Speaker