Skip to content

Evaluation

reyn ships a structured evaluation infrastructure built on top of the P6 event log. Rather than bolting on a separate observability layer, reyn treats every skill execution as a potential evaluation artifact: the same append-only event stream that powers live debug output and crash recovery is the substrate for golden-dataset testing, CI gating, and external trace export.

The key insight: If every state change is already an event (P6), evaluation is a query over that log — not a new recording system.

Why evaluation infrastructure matters

Goodhart's Law and reward hacking

UC Berkeley (2026-04) demonstrated reward hacking across eight major benchmarks: once a benchmark becomes a target, optimizing for the metric drifts away from the underlying capability. The industry-standard response is traceability — not just "what score did this run get?" but "which skill version, which model, which dataset case, and which phase produced it?"

reyn's evaluation infrastructure is designed to answer that question without additional executions. The skill_version_hash (FP-0006) is recorded in every run_skill_started event, which means retrospective version-to-version comparison is P6 log aggregation — not re-running anything.

CI/CD gates are now the industry baseline

Braintrust established CI/CD gates as a de facto standard: a PR that regresses a skill's pass rate blocks merge. reyn's reyn eval run command produces exit code 1 on threshold failure, making it a drop-in CI step.

Data sovereignty requirements

Japanese enterprises frequently cannot send trace data to US-hosted SaaS. reyn's export adapter supports Langfuse self-hosted, OTLP (local Jaeger, Grafana), IETF Agent Audit Trail (file output), and a local file backend — no mandatory external dependency.

Three-layer architecture

┌──────────────────────────────────────────────────┐
│  P6 event log  (append-only JSONL per run)       │  ← foundation
│  .reyn/events/<run_id>.jsonl                      │
└──────────────────────────────────────────────────┘
             ↓ Component A: export adapter
┌──────────────────────────────────────────────────┐
│  Export adapter                                  │  ← forwarding layer
│  Langfuse / OTLP / IETF Audit Trail / file       │
└──────────────────────────────────────────────────┘
             ↓ Component B: reyn eval
┌──────────────────────────────────────────────────┐
│  reyn eval run / report / compare                │  ← operator surface
│  Golden dataset runner + CI threshold gate       │
└──────────────────────────────────────────────────┘

Layer 1: P6 event log (foundation)

Every run already produces a JSONL event log at .reyn/events/<run_id>.jsonl. This is the P6 guarantee (see concepts/events.md): every state change emits an event; the log is append-only and replayable. Evaluation infrastructure reads from this log — it does not add a new recording path.

The IETF Agent Audit Trail draft fields map naturally onto P6 event types:

IETF field P6 mapping
identity chain_id / skill_name
timing timestamp (present on all events)
routing run_skill_started.state_dir
parameters tool_executed.op + tool_executed.args

Layer 2: export adapter (Component A)

An async adapter that forwards P6 events to external evaluation platforms after skill execution completes. Export failures emit a warning only — the P6 core write is independent and unaffected.

P7 compliance: the adapter reads only type / timestamp / data and has no knowledge of skill-specific field names. The generic event schema is forwarded as-is; skill-domain knowledge lives in the external tool's rubrics, not in the adapter code.

Supported backends: Langfuse (self-hosted or cloud), OTLP (OpenTelemetry — local Jaeger, Grafana, Honeycomb), IETF Agent Audit Trail (file output, draft spec), file (local .reyn/traces/, default).

Layer 3: reyn eval (Component B)

The operator-facing surface. reyn eval run executes a skill against a golden JSONL dataset, compares final_output against expected, and exits with a non-zero code when the pass rate is below --threshold. reyn eval report summarises past results. reyn eval compare compares two skill versions using P6 log data — no additional executions required.

Four-component map

Component Description Dependency
A — Export adapter P6 → Langfuse / OTLP / IETF / file none
Breyn eval command Golden dataset runner + CI gate + report Component A (optional)
C — Regression compare Version-to-version diff from P6 logs FP-0006 skill_version_hash
Djudge_output op LLM scorer callable from any phase none

Components A, B, and D are independent of FP-0006 and can be used without Component C.

Positioning

P7 compliance. The OS has no knowledge of skill-specific rubric content. The judge_output op (Component D) receives a target path and a rubric string supplied by the calling skill — the OS-side implementation knows only the score and whether the threshold passed. Skill-domain evaluation criteria never appear in OS code.

OSS self-host support. Langfuse and Grafana/Tempo are both OSS and self-hostable. The local file backend requires no external service. reyn does not mandate any SaaS dependency for evaluation.

IETF Agent Audit Trail alignment. The IETF draft (draft-sharif-agent-audit-trail) is under active development. reyn's export adapter produces output sympathetic to the draft's field requirements using the P6 event mapping above. The spec is noted as draft status in the exporter configuration.

Competitive comparison

Feature Braintrust Langfuse Reyn
CI/CD eval gate ✓ (reyn eval run --threshold)
Version regression compare partial ✓ (P6 log aggregation, no re-runs)
External export Braintrust SaaS only Langfuse only Langfuse / OTLP / IETF / file
Self-host support ✓ (all backends)
IETF Agent Audit Trail ✓ (draft compliance, Component A)
LLM scorer in skill phases ✓ (judge_output op, Component D)
P7 OS/skill separation N/A N/A ✓ (rubric is always skill-supplied)

Phase 1 scope

Included (Components A, B, D):

  • File export backend (default, no configuration required)
  • Langfuse, OTLP, and IETF export backends (configured in reyn.yaml)
  • reyn eval run — golden dataset runner with CI threshold gate
  • reyn eval report — past results summary
  • judge_output op — LLM scorer callable from any phase
  • Workspace isolation — eval runs do not contaminate the production workspace

Deferred (Component C — requires FP-0006):

  • reyn eval compare — version-to-version regression comparison using skill_version_hash

See also