Run SWE-bench¶
This how-to covers running Reyn against SWE-bench (the standard coding-agent benchmark) to measure how the general agent solves real GitHub issues. There are two surfaces, and they are independent:
- Solve a single instance —
scripts/swe_bench_runner.pyruns the general agent on one SWE-bench instance and emits a prediction (a git patch). It does not score. - Run a batch —
reyn eval benchmarkruns a workflow across a JSONL task file with concurrent dispatch and built-in faithful scoring.
Scoring in both cases is delegated to the official swebench harness, which is
an optional dependency — see Faithful scoring.
Prerequisites¶
| Requirement | Why |
|---|---|
Reyn installed (reyn --version) |
the solver |
| Docker running | faithful runs use the official pre-built SWE-bench instance images |
pip install swebench |
optional — required only for authoritative Tier-1 scoring; see below |
A SWE-bench instance is a JSON object in the standard dataset format:
{
"instance_id": "django__django-1234",
"repo": "django/django",
"base_commit": "abc123...",
"problem_statement": "...",
"hints_text": "...",
"test_patch": "..."
}
hints_text and test_patch are optional. The agent solves from the issue and
the repository only — the held-out test_patch is not put in the prompt;
the harness applies it at scoring time.
Solve a single instance¶
scripts/swe_bench_runner.py solves one instance with the general agent and
prints one prediction JSON object on stdout:
# Faithful run: solve inside the official pre-built instance image
python scripts/swe_bench_runner.py \
--input instance.json \
--env-backend docker \
--image swebench/sweb.eval.x86_64.django__django-1234:latest \
--repo-dir /testbed
# Or read the instance JSON from stdin
cat instance.json | python scripts/swe_bench_runner.py --stdin --env-backend docker --image <IMAGE>
| Flag | Default | Notes |
|---|---|---|
--input PATH / --stdin |
— | one is required — the instance JSON source |
--env-backend host\|docker |
host |
docker = faithful per-instance container run (recommended) |
--image IMAGE |
— | required with --env-backend docker — the pre-built SWE-bench instance image |
--repo-dir PATH |
/testbed |
in-container repo working tree |
--model-name NAME |
reyn |
value for the harness model_name_or_path field |
--timeout SECONDS |
600 |
max wall-clock for the solve |
Under --env-backend docker the runner owns the container lifecycle: it starts
the instance image, provisions a Python 3.11 venv with Reyn inside the
container, then drives the general agent (via reyn run-once) to explore →
edit → verify against the repo. Web tools are disabled so the agent cannot look
up the upstream fix. The repo's git diff HEAD becomes the prediction.
Output is a single JSON object on stdout:
On failure (non-zero solve, timeout, or unparseable output) it emits
{"instance_id": ..., "model_name_or_path": ..., "error": "..."} and still
exits 0, so a batch loop over many instances is never aborted by one bad
instance. All progress and diagnostics go to stderr.
To score the prediction, feed it to the official swebench harness
(python -m swebench.harness.run_evaluation), or use the batch driver below,
which scores inline.
Run a batch¶
reyn eval benchmark runs a workflow across a JSONL task file (one task per line)
with concurrent dispatch, and scores each result inline:
There is no bundled
swe_benchworkflow. The batch driver runs whatever workflow you name, and the previously bundledswe_benchworkflow was retired — soreyn eval benchmark swe_bench …will dead-end with "workflow not found". The batch path requires a caller-supplied workflow. The agent-routed (no-workflow) SWE-bench solve is the single-instance runner above; to cover a full dataset that way, loop the runner over instances and score the predictions with theswebenchharness yourself. There is no single bundled command that runs the agent-routed solve over all of SWE-bench Verified and scores it.
reyn eval benchmark <SKILL> \
--tasks swe_bench_verified.jsonl \
--output results/ \
--clone-task-repo \
--concurrency 4 \
[--limit 50] \
[--resume]
| Flag | Default | Notes |
|---|---|---|
<WORKFLOW> |
— | the workflow to run on each task (resolved via reyn/project → local → stdlib) |
--tasks PATH |
— | required — JSONL; each line is one task input |
--output DIR |
— | required — results land under <DIR>/run_<timestamp>/ |
--clone-task-repo |
off | needed for SWE-bench — clones <repo> and checks out <base_commit> into each task's workspace |
--concurrency N |
4 |
max concurrent runs |
--limit N |
all | stop after the first N tasks (after --resume filtering) |
--resume |
off | skip instances already completed in the latest prior run under --output |
--model MODEL |
from reyn.yaml |
model override |
The batch auto-detects the verification environment once at start: with Docker available it runs Tier-1 faithful scoring (the official SWE-bench image applies your patch + the held-out test patch and reports the authoritative verdict); otherwise every result is honest-skipped (see below).
Results land under results/run_<timestamp>/:
results/run_<timestamp>/
summary.json # faithful accounting + per-instance verdicts
patches/<id>.diff # extracted model patch (when the output carries one)
logs/<id>.jsonl # per-instance event log
The final stdout line and summary.json always show the faithful
accounting: how many results were faithfully verified, how many passed, and
how many were skipped. The authoritative harness verdict — not the agent's own
self-check — drives the pass count; a result is never counted as pass or fail
unless it was faithfully verified.
Non-interactive permissions¶
reyn eval benchmark never prompts. Every permission the workflow needs must be
pre-approved before the run, either by running the workflow once interactively and
accepting the prompts, or by granting in reyn.yaml:
permissions:
python.safe: allow
python.unsafe: allow # also requires --allow-unsafe-python at runtime
Without prior approval a task fails and is reported as not-finished. See the reyn eval reference and Manage permissions.
Faithful scoring and the honest-skip¶
Authoritative scoring is delegated to the official swebench harness, which is
an optional dependency Reyn does not install by default. The behaviour when
it is absent is the key gotcha to understand:
- With
pip install swebench+ Docker: each patch is scored by the official harness (applies the model patch + the held-out test patch in the pre-built image, runs the tests, reportsresolved). - Without
swebenchinstalled: scoring honest-skips. The result is markedverify_skippedwith a clear reason — Reyn never emits a fake PASS/FAIL. In the batch, skipped results are excluded from the pass rate (which becomesnullif nothing was faithfully verified), and the prominent accounting line reports the skip count.
So a batch can complete "successfully" with a null pass rate purely because
swebench was not installed — the patches were produced, but nothing was
authoritatively scored. Always check the faithful-verified / skipped counts in
the summary, not just that the run finished.
See also¶
- Configure the sandbox — how container/host execution is bounded
- Manage permissions — pre-approval mechanics
- SWE-bench integration proposal — the original design record