eval_builder — rubric construction rules¶
Use this when you (the eval_builder skill) generate an eval spec for a target skill. A good rubric is specific, testable, and phase-keyed — and it focuses on the things that actually matter to the skill's purpose.
Spec shape¶
---
skill_dsl_path: <target_skill_name>
model: standard
---
# Case: <case_name>
input: <input string or JSON artifact>
## Phase: <phase_name>
- <criterion 1>
- <criterion 2>
## Phase: <other_phase_name>
- <criterion>
One file = one or more cases. Each case has phase-keyed criteria.
What makes a good criterion¶
A criterion must be:
- Specific. "The summary is good" doesn't grade reliably. "The summary is 2-4 sentences" does.
- Phase-aligned. Criteria target the phase that produces the relevant output. Don't put summary-quality criteria on the outline phase.
- Verifiable from output alone. A judge with no access to the input shouldn't be able to give a wildly wrong score. If a criterion needs the input to evaluate, include enough of the input in the criterion text.
- Testable in both directions. Criteria should fail on bad output. If you can't imagine an output that fails the criterion, it's not a useful test.
- Evidence-bound. Criteria should require the output to contain something specific derived from the input, not just to look right. Shape-only criteria are gameable — an output that is structurally correct but content-empty can satisfy them.
| Bad (shape only) | Good (evidence-bound) |
|---|---|
| "The paragraph is 2-4 sentences." | "The paragraph mentions all three angles from the bullet list." |
| "The summary is well-structured." | "The summary names the specific topic from the input within its first sentence." |
| "The response has at least two examples." | "Each example is drawn from the domains listed in the input constraints." |
If a criterion can be satisfied by lorem ipsum that happens to have the right shape, rewrite it.
Adversarial self-check¶
Before finalising a rubric, apply this check:
Every eval spec SHOULD include at least one case whose criteria fail on a deliberately weak, empty, or off-topic output. If you cannot imagine an output that fails your criteria, your criteria are gameable.
Concrete procedure:
- Pick the most permissive criterion in each phase section.
- Imagine submitting an output that is structurally correct but contains no real content (e.g. placeholder text, a bare
{}, filler sentences). - If that imaginary output passes, the criterion is shape-only — add an evidence-bound clause or rewrite.
An eval spec with no failure-capable criteria offers no signal. A rubric that passes on empty input is indistinguishable from no rubric at all.
Tags¶
| Tag | Effect |
|---|---|
| (default, no tag) | Required — counted in the pass threshold |
aspirational |
Optional — graded but doesn't fail the case |
Use aspirational for nice-to-haves that are genuinely subjective (tone, polish). Use the default for hard requirements (length, structure, factual presence).
Cases¶
Cover at least:
- Happy path — typical input, all features exercised.
- Edge case — short/empty input, very long input, ambiguous input.
- Anti-pattern — input that should trigger a refusal / fallback (if the skill has one).
Three cases is a reasonable starting rubric. More if the skill has many code paths.
Don't¶
- Don't write criteria that only the eval skill knows. A judge sees only the artifact and the criterion text — write criteria that make sense without extra context.
- Don't restate phase logic. The criteria test outcomes, not implementation. "The phase reads from memory" isn't a criterion; "the answer reflects the user's stated preference" is.
- Don't grade things the runtime already validates. Schema conformance is checked by the OS — don't add "the output has all required fields" as a criterion.
- Don't use vague modifiers. "Reasonable", "adequate", "good enough" — replace with concrete thresholds.
Output expectations¶
When you finish:
- Write
eval.mdto the target skill's directory. - Read it back, mentally run each criterion against an imaginary good output and an imaginary bad output. If both pass or both fail, rewrite.
- Tell the user how to run it:
reyn eval <path/to/eval.md>. - Note any pre-approvals the target skill needs (Python preprocessor steps, file write paths, etc.) — eval is non-interactive.
Attribution¶
Adversarial-rubric discipline draws on Berkeley RDI's Exploiting the most prominent AI agent benchmarks (2026-04, https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) which showed major benchmarks could be gamed with structurally-correct but content-empty submissions.