Skip to content

Dogfood Scenario Framework

A regression-grade scenario suite that asserts three observation surfaces — reply text, emitted P6 events, and workspace artifacts — against docs/feature-map.md coverage. Scenarios are authored as YAML and reused across Reyn releases as a continuous regression suite, not as one-shot batch prep.

Why

Three mechanisms exist today and were not unified before FP-0036:

reyn eval Dogfood scenario framework
Entry point reyn run <skill> reyn chat (router decides skill)
Verification Per-phase rubric (LLM judge) reply + events + artifacts
Scope One skill at a time Feature-map coverage
Outcome scale Binary pass/fail 4-band (verified / inconclusive / refuted / blocked)
Use case Per-skill regression System-wide e2e regression

The framework is orthogonal to reyn eval. It reuses the judge_output op backend and the baseline comparison pattern, but the CLI surface and YAML schema are distinct. It is also orthogonal to one-shot batch preludes — those are Markdown prose, not machine-readable, not reusable across batches.

LLM stochasticity, replay cost, feature drift, and coverage gaps are the four driving constraints:

  • Stochasticity — assertions use stability bands, not binary pass/fail
  • Cost — full-suite re-runs use LLMReplay fixtures (zero LLM cost)
  • Driftreyn dogfood compare <baseline> <candidate> surfaces regressions vs noise
  • Coveragereyn dogfood coverage lists uncovered feature-map entries

Schema

Each scenario set is a YAML file under dogfood/scenarios/. The top-level covers: lists features covered by the whole set; each scenario has its own covers: that feeds the coverage matrix.

Single-turn scenario

type: dogfood_scenario_set
name: chat_router_smoke
description: Chat router intent dispatch + stdlib catalog dispatch smoke
covers:
  - chat-router/intent-routing
  - stdlib-skills/direct-llm

scenarios:
  - id: simple_greeting
    covers: [chat-router/intent-routing, stdlib-skills/direct-llm]
    input: "こんにちは、何ができますか?"
    expected:
      reply:
        kind: judge
        rubric:
          - explains capabilities at high level
          - mentions chat / skills / agents
      events:
        must_emit:
          - { type: skill_run_spawned, count: ">=1" }
          - { type: skill_run_completed, status: success }
        must_not_emit:
          - { type: permission_denied }
      artifacts:
        - { skill: direct_llm, present: true }
    outcome_prediction:
      verified: 0.7
      inconclusive: 0.2
      refuted: 0.05
      blocked: 0.05

Multi-turn scenario

Multi-turn scenarios use prompts: [...] instead of input:. The expected block applies to the final turn unless per_turn_expected is specified.

  - id: multi_turn_plan
    covers: [os-core/phase-engine/act-decide-loop]
    prompts:
      - "コードを改善してください"
      - "変更を適用してください"
    expected:
      events:
        sequence:
          - skill_run_spawned
          - skill_run_completed

input and prompts are mutually exclusive; the loader raises ScenarioLoadError if both are present.

Verification surfaces

Three verifiers run independently and their outcomes compose as worst-case: one refuted verdict refutes the whole scenario.

Reply

kind controls matching:

kind assertion
judge rubric is a list of natural-language criteria; judge_output op scores each
substring value string must appear anywhere in the reply
exact value string matches the reply (trimmed)
regex value pattern matches via re.search

Events

must_emit asserts event presence with optional count comparator (>=1, ==2, <5, …) and payload subset match. must_not_emit asserts absence. sequence asserts an ordered subsequence of event types across the run.

Artifacts

Each ArtifactAssertion tests workspace state: presence/absence by skill and/or type, with an optional fingerprint (SHA256 of normalised content) for pinned regression.

4-band outcomes

Each verifier returns one of four bands:

  • verified — assertion clearly passed
  • inconclusive — insufficient signal to decide
  • refuted — assertion clearly failed
  • blocked — infrastructure failure (timeout, missing fixture, …)

Event and artifact verifier results dominate; judge is a tiebreaker on inconclusive outcomes. See Dogfood discipline for band semantics and Brier scoring methodology.

outcome_prediction declares an expected 4-band probability distribution (must sum to 1.0 ± 0.001). Brier score measures calibration quality across runs.

Coverage

Each scenario's covers: tags map to feature paths in docs/feature-map.md. The path scheme is lowercase kebab-case:

### OS Core          -> os-core
#### Phase Engine    -> os-core/phase-engine
| Act/Decide loop |  -> os-core/phase-engine/act-decide-loop

reyn dogfood coverage (or --json for machine consumption) reads all scenario sets and reports:

Total features:   187
Covered:           42  (22%)
Uncovered:        145

Uncovered (sample):
  os-core/llm-validation/artifact-schema-validation
  control-ir-ops/sandboxed-exec
  stdlib-skills/skill-builder
  ...

Unknown tags (= tags that match no feature path) are surfaced as warnings without failing the run.

Regression workflow

# Record a baseline
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --n 5 --baseline smoke-v1

# After a Reyn change, run candidate and compare
reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --n 5
reyn dogfood compare smoke-v1 <run_id>

compare reports regressions, newly-passing scenarios, and Brier drift. Exit code 1 if any scenario regresses beyond --threshold (default 0.1).

Replay mode

First-run fixtures are recorded to dogfood/fixtures/<scenario_id>/ by the LLMReplay integration. Subsequent runs with --replay <fixture_dir> use recorded LLM responses — zero LLM cost, fully deterministic:

reyn dogfood run dogfood/scenarios/chat_router_smoke.yaml --replay dogfood/fixtures/

Replay-mode runs are tagged in run_id so reports distinguish them from live runs. Fixtures are re-recorded on every release tag; a schema mismatch forces re-recording automatically.

Replay is built on src/reyn/testing/replay.py (LLMReplay); see Replay tests guide for fixture recording mechanics.

Cross-references