What do you do when the data you most need to train and evaluate on is the data you're least allowed to keep? It's a bind for anyone building AI in a high-stakes vertical: the cases that would teach your model the most — the rare, the messy, the sensitive — tend to be the ones wrapped in the tightest constraints. In healthcare it's near-absolute. PHI can't be retained, reused, or transformed, so your long-lived datasets can't contain real patient data at all. Synthetic data is the obvious escape hatch, but it has its own trap: synthetic records tend to look synthetic, and a model that passes on fake-looking data tells you nothing about the real thing. So the bar isn't generating data — it's generating data faithful enough to trust. This talk is how we got there. Ask an LLM for a full case in one shot and you get something generic and averaged-out — models are worse at inventing convincing, specific detail than you'd expect. We present our synthetic generation pipeline (and the process around it) that enabled us to create golden datasets at scale. The pipeline features a coarse-to-fine process that enriches a patients medical history layer by layer, with a human in the loop hooks to steer the narrative at each step. You'll leave with ideas on how to build your own synthetic data generation capabilities and how to build a data pipeline your domain experts actually enjoy owning.
AI in Healthcare sessions at AI Engineer World's Fair 2026 in San Francisco.
Thursday, July 2, 2026
3:20 PM - 3:40 PM·20m
Track 7 · Room 2024
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Anuj Iravane
AI Lead
Anterior
Anuj leads AI @ Anterior, building production AI agents for high-stakes healthcare workflows. Before Anterior, he worked on recommender systems at Amazon. Beyond AI, he's a producer and director in India's independent film scene.