— Live room · llm-engineering
LLM Engineering
Prompting, eval, fine-tuning, deployment.
Structured outputs wherever the provider supports it. Hand-rolled parsers are fine until the model ships a quirk and you spend a week chasing a trailing comma.
Any strong opinions on structured outputs vs prompt-and-parse for production? We are migrating and it is surprisingly invasive.
That is the whole game. Pin your judge or version your eval.
Judge drift is real. We caught it last quarter by accident when a pass rate jumped 12% overnight - turned out the judge had been updated, not our model.
We run a small golden set per product, plus a judge model to catch drift. Judge model itself gets re-evaluated monthly.
Opening question: what eval harness are you actually using, not just the one you say you use in your deck?
Space for anyone shipping LLMs in production. Evals, prompts, fine-tunes, cost, latency - bring your scars.
About this room
Engineering-focused conversations about shipping LLM systems in production.