Build an LLM Evals Harness
Eight lessons. Ship a reusable evals harness — golden sets, structural + LLM judges, regression detection, CI integration.
What you'll build
- 01 A reusable `evals` CLI with golden-set runner, judge pipeline, and cost tracking.
- 02 Both judge styles on one task — structural (JSON-shape, exact-match) and LLM-as-judge with rubrics.
- 03 Regression detection that fails the build when scores drop on baselined items.
- 04 GitHub Actions integration that runs evals on every PR and posts a summary comment.
8 lessons across 2 phases.
- 01 Setup lesson_setup
Install the Anthropic SDK and run your first eval-shaped call — input email in, parsed JSON out.
- 02 Define your task lesson_task
Lock down what "correct output" means before you measure anything. Zod schema, system prompt, per-field grading rubric.
- 03 Golden set lesson_golden-set
Source 30 realistic support emails, label them, commit them. Without good golden data, every later judge measures the wrong thing.
- 04 Structural judges lesson_structural-judges
Build the cheap deterministic judge tier — exact, shape, regex. See exactly which fields they grade well and which they can't touch.
- 05 LLM-as-judge lesson_llm-judge
Score subjective fields (sentiment, urgency-with-rationale) with an LLM rubric judge — and calibrate it against human labels before you trust it.
- 06 Harness CLI lesson_harness
Wrap structural + LLM judges in a real CLI — `evals run`, `evals report` — with response caching and cost tracking so iteration doesn't drain your API credits.
- 07 Regression detection lesson_regression-detection
Snapshot today's run as a baseline. Tomorrow's run gets diffed against it. Score drops above threshold fail the build.
- 08 Wiring evals into your workflow lesson_ci-integration
Four places eval discipline can live — local script, pre-commit hook, CI, or as a tool your coding agent calls. Survey the tradeoffs; pick what fits.
What you need before you start.
- Comfort with TypeScript
- You should be able to read a function signature without running away. Strict mode is on.
- Node.js 22+
- pnpm-workspace monorepo. If you've shipped one Node project, you're set up.
- An Anthropic API key
- Verify scripts make real API calls (~$0.001 each on Haiku 4.5; ~$0.004 on Sonnet 4.6 for the LLM judge lessons). Default runs are capped at 10 items; the full ~30-item golden set is opt-in via `--full`.
Two lines to start.
Run the first in your shell. Run the second from inside Claude Code, in plain English.
First time here?
You'll need the lwc CLI and the Claude Code plugin
installed once before the two lines below will work.
Full walkthrough →