reyn dogfood
Scenario-based regression testing for the chat router (FP-0036). Declare expected behaviour in YAML scenario sets, run them against the real system, and track pass/fail across releases.
Synopsis
reyn dogfood run <SET_YAML> [OPTIONS]
reyn dogfood coverage [--feature-map FILE] [--json] [<SET_YAML>...]
reyn dogfood report <RUN_ID> [--json]
reyn dogfood compare <BASELINE> <CANDIDATE> [--threshold FLOAT] [--json]
reyn dogfood baseline <RUN_ID> [--label NAME]
reyn dogfood publish <RUN_ID> [--repo OWNER/REPO] [--category SLUG] [--dry-run] [--template PATH] [--batch-id N] [--topic TOPIC]
Description
reyn dogfood drives the chat router with structured scenario sets — each scenario declares an input prompt and expected behaviour across three observation surfaces:
- reply — natural language output (judge / substring / regex)
- events — P6 event log (must_emit / must_not_emit)
- artifacts — workspace artifacts produced by the run
Each scenario returns a 4-band outcome: verified | inconclusive | refuted | blocked. Outcomes are tracked across runs so regressions are surfaced automatically with reyn dogfood compare.
Storage layout
.reyn/dogfood/
runs/<run_id>/
scenarios/<scenario_id>/
output.json # reply + verifier verdicts
events.jsonl # captured P6 event tail
artifacts/ # workspace snapshot
summary.json # 4-band aggregate + Brier score
baselines/<label>/ # symlink to a named baseline run
Outcome scale
| Outcome |
Meaning |
verified |
All verifiers passed. |
inconclusive |
Verifiers could not determine pass or fail (e.g. judge uncertainty). |
refuted |
At least one verifier failed. |
blocked |
The scenario could not run (e.g. permission denied, agent error). |
Outcome ordering (worst to best): blocked < refuted < inconclusive < verified.
reyn dogfood run — execute a scenario set
Run every scenario in the YAML file through the chat router and record results.
Synopsis
reyn dogfood run <SET_YAML> [OPTIONS]
Positional arguments
| Name |
Description |
SET_YAML |
Path to a scenario set YAML file (e.g. dogfood/scenarios/chat_router_smoke.yaml). |
Options
| Flag |
Default |
Description |
--n N |
1 |
Number of repetitions. Use N ≥ 3 for stability bands; worst-case outcome wins across repetitions. |
--replay FIXTURE_DIR |
— |
Run in replay mode using recorded LLM fixtures. No live LLM calls are made. |
--agent NAME |
default |
Chat-router agent name. |
--storage DIR |
.reyn/dogfood/runs/<run_id> |
Override the run output directory. |
--run-id RUN_ID |
(auto UUID) |
Explicit run ID. |
Exit codes
| Code |
Meaning |
0 |
Run completed (any outcome distribution). |
2 |
Error: scenario file not found, dependency not available. |
Output
dogfood run: chat_router_smoke (3 scenarios, n=1)
run_id : a1b2c3d4-...
verified : 2
inconclusive: 1
refuted : 0
blocked : 0
total : 3
verified % : 66.7%
Brier : 0.1200
results → .reyn/dogfood/runs/a1b2c3d4-.../summary.json
Examples
# Single run
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml
# 5 repetitions for stability
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --n 5
# Deterministic replay (no LLM cost)
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --replay dogfood/fixtures/chat_router_smoke/
# Custom storage
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --storage /tmp/my_run
reyn dogfood coverage — feature-map coverage
Show which feature-map features are covered by scenario sets.
Synopsis
reyn dogfood coverage [--feature-map FILE] [--json] [<SET_YAML>...]
Positional arguments
| Name |
Description |
SET_YAML... |
Zero or more scenario set YAML files. Defaults to dogfood/scenarios/*.yaml if omitted. |
Options
| Flag |
Default |
Description |
--feature-map FILE |
docs/feature-map.md |
Path to the feature map Markdown file. |
--json |
— |
Output coverage as JSON. |
Exit codes
| Code |
Meaning |
0 |
Coverage computed successfully. |
2 |
Error: missing files, F4 module not available. |
Examples
# Default: all scenario sets, default feature map
reyn dogfood coverage
# Specific file + JSON output
reyn dogfood coverage dogfood/scenarios/chat_router_smoke.yaml --json
# Custom feature map
reyn dogfood coverage --feature-map docs/my-feature-map.md
reyn dogfood report — print stored run results
Print the 4-band breakdown and Brier score from a previous run.
Synopsis
reyn dogfood report <RUN_ID> [--json]
Positional arguments
| Name |
Description |
RUN_ID |
Run ID (UUID) or path to the run directory. |
Options
| Flag |
Description |
--json |
Emit the report as JSON. |
Exit codes
| Code |
Meaning |
0 |
Report printed. |
2 |
Run directory not found or summary.json missing. |
Output
Run: a1b2c3d4-...
Set: chat_router_smoke
Started: 2026-05-16T10:00:00+00:00
Completed: 2026-05-16T10:02:15+00:00
verified : 2
inconclusive: 1
refuted : 0
blocked : 0
total : 3
verified % : 66.7%
Brier : 0.1200
Scenarios:
✓ simple_greeting verified
? complex_multi_turn inconclusive
✓ skill_dispatch_smoke verified
Examples
reyn dogfood report a1b2c3d4-1234-...
reyn dogfood report a1b2c3d4-1234-... --json
reyn dogfood report .reyn/dogfood/runs/a1b2c3d4-1234-...
reyn dogfood compare — regression diff
Compare a candidate run against a baseline. Exits 1 if the verified-rate drop exceeds --threshold.
Synopsis
reyn dogfood compare <BASELINE> <CANDIDATE> [--threshold FLOAT] [--json]
Positional arguments
| Name |
Description |
BASELINE |
Baseline run ID (or path). |
CANDIDATE |
Candidate run ID (or path). |
Options
| Flag |
Default |
Description |
--threshold FLOAT |
0.05 |
Verified-rate drop that triggers a regression alert. Default: 0.05 = 5 percentage points. |
--json |
— |
Emit comparison as JSON. |
Exit codes
| Code |
Meaning |
0 |
No regression detected (or delta within threshold). |
1 |
Regression alert: verified-rate dropped by more than --threshold. |
2 |
Error: run directories not found. |
Output
Baseline: a1b2c3d4-... (66.7% verified)
Candidate: b2c3d4e5-... (33.3% verified)
Delta: -33.4pp / threshold=-5.0pp
Result: REGRESSION ALERT
Regressed scenarios (1):
- complex_multi_turn: verified → refuted
Examples
# Compare two runs
reyn dogfood compare a1b2c3d4-... b2c3d4e5-...
# Stricter threshold (10pp)
reyn dogfood compare a1b2c3d4-... b2c3d4e5-... --threshold 0.10
# CI: JSON output + exit code 1 on regression
reyn dogfood compare baseline_run candidate_run --json; echo "exit: $?"
# Using a named baseline
reyn dogfood compare .reyn/dogfood/baselines/v1.2-stable b2c3d4e5-...
reyn dogfood baseline — tag a run as a named baseline
Create a symlink under .reyn/dogfood/baselines/<label>/ pointing at a stored run.
Synopsis
reyn dogfood baseline <RUN_ID> [--label NAME]
Positional arguments
| Name |
Description |
RUN_ID |
Run ID (or path) to mark as a baseline. |
Options
| Flag |
Default |
Description |
--label NAME |
(run_id) |
Human-readable label for the baseline (e.g. v1.2-stable). |
Exit codes
| Code |
Meaning |
0 |
Baseline created (or overwritten). |
2 |
Run directory not found. |
Examples
# Tag a run with the default label (= run_id)
reyn dogfood baseline a1b2c3d4-...
# Tag a run with a release label
reyn dogfood baseline a1b2c3d4-... --label v1.2-stable
# Use in compare
reyn dogfood compare v1.2-stable b2c3d4e5-...
Scenario sets are YAML files that declare a set of chat-router scenarios:
type: dogfood_scenario_set
name: chat_router_smoke
description: Chat router intent dispatch smoke test
covers:
- chat-router/intent-routing
- stdlib-skill/direct_llm
scenarios:
- id: simple_greeting
covers: [chat-router/intent-routing, stdlib-skill/direct_llm]
input: "こんにちは、何ができますか?"
expected:
reply:
kind: judge
rubric:
- explains capabilities at high level
- mentions chat / skills / agents
events:
must_emit:
- { type: skill_run_spawned, count: ">=1" }
- { type: skill_run_completed, status: success }
must_not_emit:
- { type: permission_denied }
artifacts:
- { skill: direct_llm, present: true }
outcome_prediction:
verified: 0.7
inconclusive: 0.2
refuted: 0.05
blocked: 0.05
outcome_prediction enables Brier score tracking — declare your confidence in each band and the framework measures calibration over time.
reyn dogfood publish — publish a batch Discussion to GitHub
Read a stored run's summary.json, render the Discussion body from the Markdown template, and create a thread in the configured GitHub Discussions category.
Authentication: set GH_TOKEN or GITHUB_TOKEN env var (same convention as the gh CLI). The command exits with an error if neither is set and --dry-run is not passed.
Synopsis
reyn dogfood publish <RUN_ID> [--repo OWNER/REPO] [--category SLUG] \
[--dry-run] [--template PATH] \
[--batch-id N] [--topic TOPIC]
Positional arguments
| Name |
Description |
RUN_ID |
Run ID (UUID) or path to the run directory. |
Options
| Flag |
Default |
Description |
--repo OWNER/REPO |
tya5/reyn (or detected from git remote) |
GitHub repository to post the Discussion in. |
--category SLUG |
dogfood-batches |
Discussion category slug. |
--dry-run |
— |
Render the title and body to stdout without posting to GitHub. |
--template PATH |
docs/deep-dives/contributing/templates/dogfood-discussion-template.md |
Override the template file. |
--batch-id N |
(from summary.json) |
Batch number override (required if summary.json lacks batch_id). |
--topic TOPIC |
(from summary.json) |
Short topic string override (required if summary.json lacks topic). |
Authentication
GH_TOKEN takes precedence over GITHUB_TOKEN. The token must have the write:discussion scope (or repo scope for private repositories).
export GH_TOKEN="ghp_..."
reyn dogfood publish <RUN_ID>
Exit codes
| Code |
Meaning |
0 |
Discussion created (or dry-run completed). |
1 |
GitHub API error (network, auth, GraphQL error). |
2 |
Error: run directory not found, summary.json missing, template not found. |
Batch <N> (YYYY-MM-DD): <topic> — <verified_pct>% verified, <regressed_count> regressed
Example:
Batch 27 (2026-05-17): chat router smoke + stdlib core — 75% verified, 1 regressed
Output
Discussion created: https://github.com/tya5/reyn/discussions/42
Title : Batch 27 (2026-05-17): chat router smoke — 75% verified, 1 regressed
Number : #42
Examples
# Dry-run: see the rendered body without posting
reyn dogfood publish a1b2c3d4-... --dry-run
# Post to default repo (tya5/reyn) with batch-id + topic overrides
reyn dogfood publish a1b2c3d4-... --batch-id 27 --topic "chat router smoke"
# Post to a fork
reyn dogfood publish a1b2c3d4-... --repo acme/reyn-fork
# Use a custom template
reyn dogfood publish a1b2c3d4-... --template path/to/my-template.md
# Dry-run from a full path
reyn dogfood publish .reyn/dogfood/runs/a1b2c3d4-... --dry-run