workshop.institute
8 lessons · 2 phases

Build an LLM Evals Harness

Eight lessons. Ship a reusable evals harness — golden sets, structural + LLM judges, regression detection, CI integration.

One install command.
Jump to install Signed in as your account.

What you'll build

curriculum

8 lessons across 2 phases.

phase A · foundations 4 lessons
  1. 01
    Setup lesson_setup

    Install the Anthropic SDK and run your first eval-shaped call — input email in, parsed JSON out.

  2. 02
    Define your task lesson_task

    Lock down what "correct output" means before you measure anything. Zod schema, system prompt, per-field grading rubric.

  3. 03
    Golden set lesson_golden-set

    Source 30 realistic support emails, label them, commit them. Without good golden data, every later judge measures the wrong thing.

  4. 04
    Structural judges lesson_structural-judges

    Build the cheap deterministic judge tier — exact, shape, regex. See exactly which fields they grade well and which they can't touch.

phase B · harness + automation 4 lessons
  1. 05
    LLM-as-judge lesson_llm-judge

    Score subjective fields (sentiment, urgency-with-rationale) with an LLM rubric judge — and calibrate it against human labels before you trust it.

  2. 06
    Harness CLI lesson_harness

    Wrap structural + LLM judges in a real CLI — `evals run`, `evals report` — with response caching and cost tracking so iteration doesn't drain your API credits.

  3. 07
    Regression detection lesson_regression-detection

    Snapshot today's run as a baseline. Tomorrow's run gets diffed against it. Score drops above threshold fail the build.

  4. 08
    Wiring evals into your workflow lesson_ci-integration

    Four places eval discipline can live — local script, pre-commit hook, CI, or as a tool your coding agent calls. Survey the tradeoffs; pick what fits.

prerequisites

What you need before you start.

Comfort with TypeScript
You should be able to read a function signature without running away. Strict mode is on.
Node.js 22+
pnpm-workspace monorepo. If you've shipped one Node project, you're set up.
An Anthropic API key
Verify scripts make real API calls (~$0.001 each on Haiku 4.5; ~$0.004 on Sonnet 4.6 for the LLM judge lessons). Default runs are capped at 10 items; the full ~30-item golden set is opt-in via `--full`.
install

Two lines to start.

Run the first in your shell. Run the second from inside Claude Code, in plain English.

First time here? You'll need the lwc CLI and the Claude Code plugin installed once before the two lines below will work. Full walkthrough →

~/lwc claude
> set up the evals workshop

Sign in or get started

Enter your email — we'll send you a 6-digit code. New here? An account is created automatically.