Skip to content

Preprocessor

A preprocessor is a chain of deterministic steps that runs before the LLM is called in a phase. Each step enriches the input artifact — its result is placed at a key named by into. By the time the LLM receives the context frame, the input artifact already contains computed facts the LLM can cite rather than guess.

It mirrors the postprocessor structurally — same step types, same on_error semantics, same permission gate. The only difference is fire position: preprocessor fires at phase entry, postprocessor fires at skill finish.

Why

Deterministic work does not need an LLM

LLMs are unreliable at counting characters, summing rows, or measuring token length — tasks that Python handles precisely in microseconds. Pre-computing facts the LLM would otherwise estimate (or hallucinate) is a foundational agent-engineering pattern. Preprocessor is where that computation lives in Reyn.

This is the deterministic split principle: if an output is a pure function of the input, derive it deterministically rather than asking the LLM to reproduce it. See system-design.md for the broader framing. This principle aligns with P3 — the OS controls execution — and P5 — all data flows through the workspace, not through the LLM's reasoning chain.

Reduced token cost and improved correctness

Every fact the preprocessor computes is one fewer inference the LLM needs to make. This narrows the LLM's responsibility to judgment, synthesis, and generation — the things LLMs are good at. A phase that arrives at the LLM with stats.word_count = 847 already in the artifact produces more accurate output than one that asks the LLM to estimate word count inline.

Phase composability

A phase that reads data.stats doesn't care how stats got there. The same phase can be re-targeted at different skills by swapping what the preprocessor feeds into it, without changing the phase's instructions or schema. This is the phase reuse guarantee from P1.

Step kinds at a glance

Step type What it is for
run_skill Invoke a sub-skill, store its final output at into
iterate Fan a sub-step out over a list, collect results into into
validate Run a JSON-Schema check; surface findings so the LLM can judge them
lint_plan Run deterministic structural checks (cycles, coverage) on a plan artifact
python Call a user-supplied Python function in sandboxed safe or unsafe mode

All steps share two invariants: the result is placed at into, and steps run in declaration order — each step can read what earlier steps produced.

Full syntax for each step type is in reference/dsl/preprocessor.md.

When to use a preprocessor vs letting the LLM handle it

Situation Right home
Counting, measuring, summing Preprocessor (python)
Calling a known sub-skill before the main phase Preprocessor (run_skill)
Fan-out over a list Preprocessor (iterate)
"Check this before deciding" Preprocessor (validate) — gives the LLM the findings, then it judges
Structural sanity-check on a plan Preprocessor (lint_plan)
Open-ended judgment, synthesis, or generation LLM

The guiding question: "Is the output of this step a pure function of the input?" If yes, it belongs in the preprocessor.

Symmetry with postprocessor

Preprocessor Postprocessor
Fires at Phase entry Skill finish
Input source Upstream phase's output LLM's finish artifact
Output target Phase's input_schema (enriched) Postprocessor's output_schema
Step types run_skill / iterate / validate / lint_plan / python Identical
on_error policy fail / skip / empty per step Identical
Permission gate skill.permissions Identical

The runner shares logic between both — differences are which artifact flows in, which schema validates the output, and the fire site.

Worked example: word_stats_demo

The word_stats_demo stdlib skill is the simplest canonical example. Its review phase declares a single python preprocessor step:

preprocessor:
  - type: python
    module: ./stats.py
    function: compute_text_stats
    into: data.stats
    output_schema:
      type: object
      properties:
        char_count:        {type: integer, minimum: 0}
        word_count:        {type: integer, minimum: 0}
        line_count:        {type: integer, minimum: 0}
        longest_line_chars: {type: integer, minimum: 0}
        estimated_tokens:  {type: integer, minimum: 1}
      required: [char_count, word_count, line_count, longest_line_chars, estimated_tokens]

What it gives the LLM: input_artifact.data.stats is already populated with exact counts before the LLM call. The phase instructions tell the LLM to "cite at least one stat verbatim" — which is reliable precisely because the numbers come from Python, not from the LLM's own estimation.

Failure semantics

A preprocessor step can declare on_error: fail | skip | empty:

  • fail (default): step failure raises and aborts the phase.
  • skip: failure is logged; subsequent steps continue.
  • empty: failure produces an empty value at into; subsequent steps continue.

Default to fail for steps whose output is load-bearing. Use skip or empty for enrichment that is useful but not required for the LLM to proceed.

What phases must not do

Per P8, phase instructions must not describe how the preprocessor works or enumerate the mechanics of Control IR. The instructions should refer to enriched fields by name (data.stats) and explain what to do with them — not explain where they came from.

See also