Inference

Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards

TalkIntermediate

A talk about contributing GPU benchmarks to an open-source research paper (FlashNorm). I'll walk through the engineering journey: folding norm weights into projections, writing Triton kernels, accidentally making attention bidirectional (oops), and ultimately proving a 33-35% speedup on the norm+project operation. Practical lessons for anyone trying to optimize transformer inference.

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.

When

Thursday, July 2, 2026

3:20 PM - 3:40 PM·20m

Where

Track 9 · Room 2016

Capacity: 250 attendees

Speaker

Filip Makraduli

Machine Learning Engineer

Superlinked

@f_makraduli

Filip Makraduli is an applied AI researcher and founding ML Developer Relations engineer at Superlinked, where he designs and ships small‑LLM inference systems for search, retrieval, and agents in production. He holds a master’s degree in Biomedical Data Science from Imperial College London. Before Superlinked, Filip worked in machine learning, data science, and developer relations roles across early‑stage AI startups and larger enterprises, building language understanding, retrieval‑augmented generation (RAG), and LLM pipeline tooling while partnering closely with product and platform teams. He is a frequent open‑source contributor, with contributions to kernel libraries, model‑inference providers, and hands‑on demos used by practitioners. Filip is a co‑author of several publications on efficient transformer architectures and inference, including work on faster normalization for LLMs. He is an experienced speaker at meetups and conferences such as AI Engineer Europe and Berlin Buzzwords, sharing practical lessons on efficient transformers, retrieval systems, and embedding inference for production AI teams.

Weight Folding, CUDA Streams, and the Bug That Made My Model Speak Backwards

TalkIntermediate

About the Inference Track

Inference sessions at AI Engineer World's Fair 2026 in San Francisco.