When the Epstein files dropped, a team of indie hackers built JMail: a duplicate of Gmail that was logged in as Jeffrey himself. It went viral. But the parsing problem underneath it was brutal. Court documents are some of the nastiest inputs a parser can face. Scanned exhibits with varying resolution, redactions sitting directly over key text, inconsistent formatting across decades of filings, handwritten annotations mixed into typed pages, documents photocopied from a photocopy of a photocopy. But legal is just one flavor of hard. In finance, you're dealing with tables nested inside tables, footnotes that span pages, and numbers that mean different things depending on which section of the filing you're in. In healthcare, it's mixed handwritten and typed content, inconsistent date formats, and forms that were designed in 1987 and never updated. In government records, it's degraded scans, stamps overlapping text, and documents where a key field is missing on half the corpus. Every industry has its own specific ways documents break parsers. This session walks through the failure modes we've actually hit across these corpora, what causes them, and how to build pipelines that hold up when the documents stop cooperating.
Expo Stage 2 sessions at AI Engineer World's Fair 2026 in San Francisco.
Thursday, July 2, 2026
2:25 PM - 2:45 PM·20m
Expo Stage 2
Capacity: 250 attendees
Sign in to add this talk to your schedule.
TBA
Speaker
Speaker to be announced.