05 — Writing an eval¶
Evals turn "the output looked good" into "the output passed N criteria across M cases." This tutorial covers building a rubric for my_explainer (the skill from Tutorial 03) and running it.
The shape of an eval¶
An eval spec is a markdown file with frontmatter and one or more cases:
---
skill_dsl_path: my_explainer
model: standard
---
# Case: short_topic
input: photosynthesis
## Phase: outline
- Each bullet is a complete sentence.
- Bullets cover distinct angles (no overlap).
## Phase: expand
- The paragraph mentions all three bullet points.
- The tone is friendly, not academic.
Each ## Phase: <name> block lists rubric criteria. The eval skill judges each criterion phase-by-phase using judge_phase.
Step 1: generate a draft¶
eval_builder reads the skill, drafts cases and criteria, and writes reyn/local/my_explainer/eval.md.
Step 2: review it¶
Open the file. Adjust:
- Cases — add edge cases (empty topic, very long topic, ambiguous topic).
- Criteria — sharpen vague ones into testable statements. "The paragraph is well-written" doesn't grade reliably; "the paragraph is 2-4 sentences" does.
Step 3: run it¶
Output:
=== Eval: my_explainer [3 case(s)] ===
model=standard
━━━ case: short_topic ━━━
input: photosynthesis
✓ score=0.95 (4/4 required)
━━━ case: long_topic ━━━
...
═══════════════════════════════════════════════════
✓ 3/3 cases passed
Results → .reyn/eval-results/my_explainer/<timestamp>.json
═══════════════════════════════════════════════════
reyn eval exits with status 0 (all passed), 1 (spec failed to load), or 2 (cases failed).
Step 4: tighten the rubric¶
If a criterion is "passing on bad output," it's not specific enough. Look at the failing case:
For each failed criterion, the report includes the judge's reasoning. Use it to rewrite the criterion to be more specific, then re-run.
Eval is non-interactive¶
reyn eval doesn't prompt. Any permission the target skill needs must be pre-approved (in reyn.yaml's permissions: or saved to .reyn/approvals.yaml from a prior reyn run). Without it, the case is reported as not-finished. See manage-permissions.
What you learned¶
- An eval spec is a phase-keyed rubric in markdown.
eval_buildergenerates a draft; you review and iterate.reyn evalruns every case non-interactively and writes a report.
Next¶
- Tutorial 02 — Chat mode — if you skipped it earlier.
- Reference: stdlib/eval
- Reference: stdlib/eval_builder