Wolfram Ravenwolf

AI Evangelist

Weights & Biases by CoreWeave

Wolfram Ravenwolf is an AI Evangelist at CoreWeave / Weights & Biases, where he helps builders evaluate, debug, and ship useful AI systems. He works across model evaluation, agent tooling, inference infrastructure, and developer education, translating hands-on engineering work into practical guidance for teams adopting frontier AI. Wolfram is the creator of WolfBench, a five-metric framework for evaluating agent performance based on Terminal-Bench 2.0, and regularly tests new models, coding agents, and evaluation workflows in real-world conditions. He is also a ThursdAI co-host, speaker, writer, and longtime AI community builder. Before joining CoreWeave/W&B, he worked as an engineer, researcher, and consultant focused on making complex technology usable. His talks are practical, opinionated, and grounded in live experimentation: fewer buzzwords, more working systems.

Sessions (1)

From Zero to Leaderboard: Building an End-to-End AI Agent Evaluation Pipeline

12:10 PM·Track 5 · Room 2005

Running one agent eval is easy. Running hundreds — with controlled timeouts, replicated configs, and automated collection across distributed VMs — requires infrastructure that most teams end up building from scratch. In this workshop, we shortcut that process and build a rigorous evaluation pipeline end-to-end. Participants will set up and connect the full evaluation stack: **Layer 1 — The Benchmark Runner.** Configure Harbor to orchestrate parallel agent evaluations on Terminal-Bench 2.0, with W&B Sandboxes providing isolated environments for each task. **Layer 2 — The Collection Pipeline.** Use WolfBench to scan distributed VMs for results, deduplicate across runs, download trajectories, and build a local results archive that survives VM teardown. **Layer 3 — The Analysis Framework.** Compute the five-metric framework (Ceiling / Best / Average / Worst / Solid) across replicated runs. Learn to read the spread: when is a model "better"? When is a score difference just noise? **Layer 4 — The Observability Layer.** Upload full agent conversation traces to W&B Weave for per-turn inspection. See exactly where an agent goes wrong — the command it ran, the output it misread, the moment it started looping. **Layer 5 — The Leaderboard.** Generate interactive HTML charts that show the full performance distribution, not a single bar. We'll work with real data from hundreds of production runs, and participants will leave with a working pipeline they can adapt to their own agents and benchmarks. Laptops required; all tools are open-source.

Workshops Day 1advanced