Skill resume¶

How Reyn restores in-flight skill execution after a process crash.

What gets restored¶

When a skill is mid-execution and the Reyn process dies (kill -9, OOM, machine reboot, etc.), the next reyn chat startup automatically:

Loads the per-agent snapshot (AgentSnapshot.load)
Replays the WAL forward to the latest known state
Per skill that was in-flight:
Loads the per-skill snapshot
Builds a ResumePlan from snapshot + WAL events
Decides an action via SkillResumeCoordinator (default: retry ambiguous side-effect ops, resume from the in-flight phase)
Resumes each active skill from its current_phase (fast-forward)

The completed phases are NOT re-executed. Within the in-flight phase, already-committed side-effect ops are memoized (results loaded from WAL, not re-invoked), and LLM calls within that phase are also memoized so resume does not re-pay LLM cost.

World-purity ops re-execute on resume. Read-only network calls (web_fetch, web_search, etc.) are classified as world purity — their result depends on external state that may have shifted, so on resume the call is re-issued rather than replayed from the recorded result. This prevents transient API blips (e.g. a flaky search returning "0 results") from being permanently locked into the skill's state. Side-effect ops (file/write, mcp/call_tool writes) and LLM calls still memoize for cost / duplicate-write avoidance.

What's preserved across crashes¶

State	Where	Survives crash
Workspace artifacts	`.reyn/agents/<name>/workspace/`	yes
Per-agent state (inbox, chains, interventions)	`agents/<name>/state/snapshot.json`	yes
Per-skill state (current phase, visit counts)	`agents/<name>/state/skills/<run_id>.snapshot.json`	yes
WAL (committed ops + LLM responses)	`.reyn/state/wal.jsonl`	yes
Active asyncio.Tasks	in-memory only	no — resumes via fresh tasks on restart

Ambiguous steps and resume policy¶

A "side-effect" op (e.g. mcp/call_tool writing externally, file/write, shell) emits step_started to the WAL before invoking the underlying call. If the process crashes after step_started but before step_completed, the resume system can't tell whether the side effect actually happened — the op is ambiguous.

The Coordinator applies the resume policy from reyn.yaml:

skill_resume:
  default: retry             # default: re-invoke the ambiguous op
  per_skill:
    blog_publisher: discard_skill  # external publish: don't risk duplicate
    eval_runner: skip              # idempotent reads — skip is safe

Policy values:

retry — default. Re-invoke the op. Safe for read-only and idempotent ops (and the natural choice for the auto-resume design, since read-API memos are invalidated on resume by the world-purity rule). Risk: duplicate side effect for non-idempotent writes.
skip — treat ambiguous step as completed with empty result. Prevents duplicate side effects but the skill continues as if the op succeeded; risk: missing data downstream.
discard_skill — abort the whole skill_run.
prompt — legacy / no-op. Auto-resume never blocks on interactive prompt; specifying prompt is treated equivalently to retry.

Manual control¶

If you need to manage individual runs:

/skill list                  # show active skill runs
/skill discard <run_id>      # abort one specific run

For starting fresh:

reyn chat --no-restore       # skip restore this run (state stays on disk)
reyn chat --reset            # wipe in-flight skill state (with confirm)

Skill resume¶

What gets restored¶

What's preserved across crashes¶

Ambiguous steps and resume policy¶

Manual control¶

See also¶