Build an LLM Evals Harness

Eight lessons. Ship a reusable evals harness — golden sets, structural + LLM judges, regression detection, CI integration.

8 lessons 2 phases intermediate

Get started Signed in as your account.

why this workshop

who it's for.

Engineers shipping LLM features who want evidence that a prompt change made things better, not worse.

what you'll build

what you walk away with.

01 A reusable `evals` CLI with golden-set runner, judge pipeline, and cost tracking.
02 Both judge styles on one task — structural (JSON-shape, exact-match) and LLM-as-judge with rubrics.
03 Regression detection that fails the build when scores drop on baselined items.
04 GitHub Actions integration that runs evals on every PR and posts a summary comment.

curriculum

8 lessons across 2 phases.

phase A · foundations 4 lessons

01 Setup
02 Define your task
03 Golden set
04 Structural judges

phase B · harness + automation 4 lessons

05 LLM-as-judge
06 Harness CLI
07 Regression detection
08 Wiring evals into your workflow

prerequisites

what you need before you start.

Comfort with TypeScript: You should be able to read a function signature without running away. Strict mode is on.
Node.js 22+: pnpm-workspace monorepo. If you've shipped one Node project, you're set up.

start

how to start.

Once Claude Code is set up, tell it “set up the evals workshop” and it clones the project and queues the first lesson.

New here? The getting-started guide walks you through installing the lwc CLI and the Claude Code plugin once — everything you need before that line works.