Scaffold: An External DSL for Autonomous Harness Optimization

Typed graph mutations beat free-form code generation for LLM pipeline search

An evolutionary optimizer, constrained to typed graph mutations, started with a single LLM call and autonomously discovered a 5-step classification pipeline — combining shortlist-select with critique-repair without being told these patterns work together. On a 247-class benchmark across 3 domains, it scores 77.1% where Meta-Harness (Lee et al., 2026), a coding-agent approach searching over 40 candidates, scores 48.6%.

	Meta-Harness	Scaffold
Avg accuracy	48.6%	77.1%
Candidates searched	40	9
LLM calls/case (inference)	1	5

Scaffold loses on one of the three datasets (79.3 vs 86.8 on Symptom2Disease). This post covers what the optimizer found, how it searches, and where the approach breaks down.

The Setup

Three classification datasets, merged into one:

Dataset	Classes	Domain
LawBench	215	Legal charge classification
Symptom2Disease (S2D)	22	Medical symptom → disease
USPTO-50k	10	Chemical reaction class

247 total classes. No domain indicator in the input — the model sees raw text and a flat list of all 247 labels. The text could be a legal statute, a patient symptom description, or a chemical reaction.

The evaluation is exact match: the model's output must be character-for-character identical to the ground-truth label. Partial credit doesn't exist.

The seed graph

This is what the optimizer started with — a single LLM call:

graph classify {
    in: ClassifyInput
    out: string

    step result = classify_text(
        task_text: input.task_text,
        label_guide: input.label_guide
    )
    emit result
}

The classify_text node uses a straightforward prompt:

Classify the following text into exactly one of the categories listed below.

## Text
{{ task_text }}

## Categories
{{ label_guide }}

## Instructions
Output ONLY the exact category label from the list above.
Do not include any explanation, punctuation, or formatting.

One node, one call, one output. The optimizer's job: mutate this graph to improve accuracy on training cases.

A note on data leakage: The graph input is { task_text, label_guide }. The ground truth label expected.label appears only in the checker expression (output == expected.label), never in the graph's input fields. The optimizer sees pass/fail signals but the pipeline itself never sees the answer.

┌─────────┐      ┌───────────────┐      ┌────────┐
│  input   │─────▶│  classify_text │─────▶│ output │
└─────────┘      └───────────────┘      └────────┘
                    (1 LLM call)

What the Optimizer Discovered

After 9 candidates (3 seed evaluations, 2 tunable sweeps, 4 evolutionary mutations), the optimizer produced this graph:

graph classify {
    in: ClassifyInput
    out: string

    step result_proposed_shortlist = _motif.result_proposed.shortlist(
        input: input, task_text: input.task_text, label_guide: input.label_guide)
    step result_proposed_choose = _motif.result_proposed.choose(
        input: input, candidates: result_proposed_shortlist,
        task_text: input.task_text, label_guide: input.label_guide)
    step result_proposed = _motif.result_proposed.normalize(
        input: input, raw: result_proposed_choose, label_guide: input.label_guide)
    step result_critique = _motif.result.critique(
        input: input, proposed: result_proposed,
        task_text: input.task_text, label_guide: input.label_guide)
    step result = _motif.result.repair(
        input: input, proposed: result_proposed, critique: result_critique,
        task_text: input.task_text, label_guide: input.label_guide)

    emit result
}

Five steps, assembled from two motifs:

                    ShortlistSelect motif                    VoteCritiqueRepair motif
            ┌─────────────────────────────────┐        ┌──────────────────────────┐
            │                                 │        │                          │
┌───────┐   │ ┌───────────┐  ┌────────┐  ┌─────────┐  │ ┌──────────┐  ┌────────┐ │  ┌────────┐
│ input │──▶│ │ shortlist  │─▶│ choose │─▶│normalize│──▶│ │ critique │─▶│ repair │─▶──▶│ output │
└───────┘   │ │  (LLM)    │  │ (LLM)  │  │  (LLM)  │  │ │  (LLM)   │  │ (Python)│ │  └────────┘
            │ └───────────┘  └────────┘  └─────────┘  │ └──────────┘  └────────┘ │
            └─────────────────────────────────┘        └──────────────────────────┘
                    3 LLM calls                          1 LLM + 1 tool call

Step-by-step

Shortlist (LLM): Read the text, pick the top-K most likely labels from the full 247. Narrows the search space.
Choose (LLM): Given only the shortlisted candidates, pick the best one. Deeper analysis on a smaller set.
Normalize (LLM): Clean up the output — fix spelling, casing, truncation. Exact match is unforgiving.
Critique (LLM): A "skeptical classification auditor" that independently re-reads the text and challenges the proposed label. Outputs structured JSON: {"verdict": "keep" | "change", "better_label": "...", "confidence": 0.0-1.0}.
Repair (Python): A deterministic gate. If the critique says "change," use the new label. If "keep," pass through. No LLM involved — just JSON parsing and a conditional.

The critique prompt is worth reading in full, because the meta-agent wrote it:

You are a skeptical classification auditor. Your job is to CHALLENGE the
proposed classification and determine if a better option exists.

## Audit Protocol

Do NOT assume the proposed label is correct. Follow these steps:

1. IGNORE the proposed label initially. Read the text fresh and independently
   identify the 3 most likely categories from the list.
2. For EACH of your top 3 candidates, list the specific evidence from the
   text that supports it.
3. Now compare: does the proposed label match your top choice? If not, which
   category has the STRONGEST specific evidence?
4. Watch for these common errors:
   - Generic/broad category chosen when a more SPECIFIC subcategory exists
   - Superficially similar categories confused
   - Truncated or incomplete label names that miss qualifying details
5. Check that the proposed label is written EXACTLY as it appears in the
   category list — character for character, word for word.

After your analysis, output ONLY this JSON on the final line:
{"verdict": "keep" or "change", "better_label": "the exact correct label
copied character-by-character from the category list", "confidence": 0.0-1.0}

This is not a generic "think harder" prompt. It has a specific audit protocol targeting the failure modes this task actually exhibits: label confusion, label truncation, and overly broad classification. The meta-agent wrote it after seeing which cases were failing and how.

"Isn't this just chain-of-thought?" No. Each step has a different functional role. The shortlist node reduces the label space. The normalize node fixes formatting. The repair node is deterministic Python — it doesn't hallucinate, it parses JSON and branches. Chain-of-thought is a single model reasoning aloud; this is a pipeline with typed interfaces between steps.

The key insight — narrowing 247 labels to a shortlist before making a decision — is not novel. It's standard information retrieval: retrieve, then re-rank. But it was discovered from failure analysis, not prescribed. The optimizer saw label confusion in the training set, the motif library offered ShortlistSelect as a response to that failure pattern, and the meta-agent composed it with VoteCritiqueRepair.

How Scaffold Searches

Scaffold's optimizer is a meta-agent (an LLM) that proposes typed mutations to a directed graph. It is not a coding agent. It doesn't write Python files or bash scripts. It operates within a DSL that constrains every candidate to be a valid, type-checked graph.

The search space

Every Scaffold pipeline is a typed directed graph: nodes have declared input/output types, steps connect them, and an emit statement produces the final output. The optimizer mutates these graphs using 18 mutation types:

Category	Mutations
Content	`rewrite_prompt`, `rewrite_system`, `rewrite_shell`, `rewrite_tool_spec`, `attach_example_policy`, `add_local_checker`
Structural	`insert_step`, `remove_step`, `replace_component`, `insert_verify`, `wrap_retry`, `fan_out_parallel`, `replace_with_subgraph`, `add_prompt_step`, `set_config`
Decomposition	`propose_decomposition` (applies a motif from the library)
Recombination	Cross-candidate merging of topology + overrides

The decomposition mutations draw from 6 motifs: NormalizeVerify, ShortlistSelect, RouterExpert, RetrieveDecide, VoteCritiqueRepair, and GenerateValidateRefine. These are graph transformations — not prompt templates. ShortlistSelect replaces one step with three (shortlist → choose → normalize). VoteCritiqueRepair appends two steps after the target (critique → repair). The meta-agent chooses which motif to apply and to which step, but the transformation itself is deterministic.

What the meta-agent sees

This is where Scaffold differs most from a coding agent. Instead of a filesystem full of logs, the meta-agent receives a structured context of about ~16k tokens containing:

Pass/fail matrix: Which cases passed and failed, grouped by failure cluster
Failure clusters: The most common wrong answers and which inputs produced them
Step-transition attribution: Which pipeline step introduced the error (correct before step N, wrong after)
Motif suggestions: Pre-computed recommendations mapping failure patterns to motifs
Prior attempts: Which mutations were already tried, their scores, and whether they used custom templates
Saturated families: Which mutation categories have been exhausted (blocked from re-proposal)

The meta-agent reads this and proposes a single JSON mutation. The optimizer validates it against the type system, applies it to the graph IR, re-verifies the resulting graph, and evaluates it on training cases. If the candidate is semantically identical to one already evaluated (via content hash), it's deduplicated without re-evaluation.

Meta-Harness: A different approach

Meta-Harness (Lee et al., 2026) takes the opposite approach. It gives a coding agent access to a workspace with Python files, evaluation scripts, and a conversation history that includes full execution logs. The agent writes and edits code freely — no DSL, no type constraints. Its context per step is roughly 10M tokens (the full filesystem state plus conversation history).

┌─── Meta-Harness ──────────────────┐  ┌─── Scaffold ─────────────────────┐
│                                    │  │                                   │
│  Coding Agent                      │  │  Meta-Agent                       │
│    ↕                               │  │    ↕                              │
│  Python files + logs + history     │  │  ~16k tokens structured context   │
│  (~10M tokens context)             │  │  (pass/fail, clusters, motifs)    │
│    ↕                               │  │    ↕                              │
│  Edit code → run → read logs       │  │  Propose typed mutation → validate│
│    ↕                               │  │    ↕                              │
│  40 candidates, any code           │  │  9 candidates, valid graphs only  │
│                                    │  │                                   │
└────────────────────────────────────┘  └───────────────────────────────────┘

The tradeoff is clear: Meta-Harness has a larger search space (arbitrary code), but Scaffold has a more efficient one (every candidate is guaranteed valid and semantically deduplicated).

"Why not DSPy?" DSPy optimizes prompts within a fixed topology. If your pipeline is retrieve → generate, DSPy will find better prompts for those two steps. But it won't propose adding a normalize step, or replacing the pipeline with a shortlist-select motif. Scaffold searches topology and content jointly. The ShortlistSelect motif was a structural change — a mutation DSPy can't express.

"Is the DSL necessary?" The DSL buys two things. First, validity: every candidate is a well-typed graph that passes IR verification before evaluation. Zero budget is wasted on candidates that crash, produce type errors, or have broken imports. Second, deduplication: candidates with identical semantic hashes (same graph structure + same prompt content + same runtime overrides) are detected and skipped. Of Scaffold's 9 candidates, all 9 were unique and valid. Of Meta-Harness's 40 candidates, it's unknown how many failed at runtime.

The Results, Honestly

Accuracy

Harness	USPTO (10 cls)	S2D (22 cls)	LawBench (215 cls)	Avg
Zero-shot	12.0	63.2	7.0	27.4
Few-shot (all)	15.0	78.3	29.0	40.8
ACE	16.0	77.8	29.0	40.9
Meta-Harness	14.0	86.8	45.0	48.6
Scaffold	83.5	79.3	68.9	77.1

Cost

	Meta-Harness	Scaffold
Candidates searched	40	9
LLM calls per case (inference)	1	5 (4 prompt + 1 tool gate)
Context per optimizer step	~10M tokens	~16k tokens

There are things to celebrate here and things to scrutinize.

Scaffold wins on USPTO and LawBench by wide margins. USPTO: 83.5 vs 14.0. LawBench: 68.9 vs 45.0. On these tasks, the multi-step pipeline — particularly the shortlist narrowing — makes a large difference. With 215 classes in LawBench, asking a model to pick one from a flat list in a single call is asking a lot. Narrowing to a shortlist first helps.

Meta-Harness wins on S2D: 86.8 vs 79.3. This is a 7.5-point loss for Scaffold. The tempting explanation is "small label space" — but Scaffold wins massively on USPTO (83.5 vs 14.0) which has only 10 classes. Label count doesn't explain it. The honest answer is that Meta-Harness does better on S2D and there isn't a clean explanation for why.

The USPTO anomaly. Meta-Harness scores 14.0% on USPTO — close to random for 10 classes. Zero-shot gets 12.0% and few-shot gets 15.0%, so the baselines are also poor, but Meta-Harness doesn't improve over them. The cause is unknown. It could be a bug in the generated code, a data format issue, or a genuine failure of the approach on this domain. This single result drives a large portion of the aggregate gap, so it deserves scrutiny rather than explanation.

9 vs 40 candidates. Scaffold evaluated fewer candidates because every candidate was valid and semantically unique. There was no budget spent on broken graphs or duplicate configurations. Whether this is "fairer" depends on what you're measuring — if you're measuring optimizer efficiency, fewer-is-better. If you're measuring wall-clock cost, Scaffold's candidates are more expensive to evaluate (5 calls each).

5x inference cost. Scaffold's pipeline makes 5 calls per classification (4 LLM + 1 local Python tool). Meta-Harness makes 1. At scale, this is a 5x multiplier on inference cost. If you're classifying millions of documents and optimizing for cost, the 5-step pipeline is not the right choice. If you're classifying thousands where accuracy matters more than cost, it might be.

Content mutations overfit; topology mutations generalize. A second run on the same benchmark reached 0.80 on training — higher than the 0.77 run — but scored only 0.66 on the held-out test set. The difference: this run discovered VoteCritiqueRepair but not ShortlistSelect, relying on prompt rewrites to the critique template to push training accuracy. Those rewrites didn't transfer. The run that generalized was the one that made a structural change (narrowing 247 labels to a shortlist), not the one that wrote better prompts for a fixed topology.

Why Constrained Search Finds Different Things

Three mechanisms explain why Scaffold and Meta-Harness converge on different solutions:

1. Motifs as inductive bias

Scaffold's motif library contains 6 patterns drawn from information retrieval and prompting literature: shortlist-select, normalize-verify, router-expert, retrieve-decide, vote-critique-repair, and generate-validate-refine. When the optimizer sees a failure cluster (e.g., many cases confused between similar labels), the motif suggestion system maps it to a specific transformation (ShortlistSelect). This is not arbitrary code generation — it's pattern matching from established techniques.

The meta-agent isn't inventing new algorithms. It's selecting from a curated set and composing them. The composition (ShortlistSelect + VoteCritiqueRepair applied to the same pipeline) is what produces the novel 5-step structure.

2. Failure attribution drives mutations

Scaffold pre-computes structured diagnostics before each optimizer step. Step-transition attribution identifies which step corrupted a previously correct answer (correct → wrong) and which step repaired a broken one (wrong → correct). Failure clusters group wrong answers by their actual output, revealing systematic errors.

This means the meta-agent knows not just that the pipeline fails on 30% of cases, but that 12 cases produce the label "Fraud" when they should produce "Securities Fraud" (a specificity error), and that the corruption happens at the choose step. The mutation it proposes — rewriting the choose prompt to emphasize specific subcategories — is targeted, not random.

3. Every candidate is valid

The type system guarantees that every proposed mutation produces a well-formed graph. Insert a step with mismatched types? Rejected at IR verification, before any LLM call. Reference a nonexistent variable? Caught by the scope checker. This means the optimizer never wastes an evaluation on a candidate that would crash at runtime.

In contrast, a coding agent can produce syntactically valid Python that fails at runtime — an import error, a type mismatch, a missing file. Each failed candidate costs an evaluation cycle.

The counterargument

The DSL trades ceiling for floor. A coding agent can, in principle, write any code: a custom embedding-based retrieval system, a statistical calibration layer, a domain-specific regex preprocessor. Scaffold can only express what its motif library and mutation types support. If the optimal solution requires a technique outside the 6 motifs and 18 mutations, Scaffold can't find it.

This is a real limitation, not a theoretical one. Meta-Harness's 86.8% on S2D may reflect exactly this: the coding agent found a solution that doesn't decompose into Scaffold's motif vocabulary. This can't be verified without seeing Meta-Harness's generated code, but it's a plausible explanation.

The bet Scaffold makes is that for most tasks, composing well-established patterns is more reliable than inventing new ones from scratch. This bet pays off when the search space is large (247 classes, multiple failure modes) and fails when the task is simple enough that established patterns add overhead without benefit.

Limitations

To be explicit about what this evaluation does not show:

5x inference cost. The pipeline makes 5 calls per input. For high-volume classification, this is likely prohibitive. A production deployment would need to weigh accuracy gains against cost — and for many use cases, a single well-prompted call is the right answer.
S2D regression. Scaffold loses to Meta-Harness by 7.5 points on S2D. Multi-step pipelines are not universally better. The cause isn't label count (Scaffold wins on 10-class USPTO), and we don't have a satisfying explanation.
Fixed motif library. The 6 motifs are hand-designed. If the optimal solution requires a pattern not in the library (e.g., self-consistency voting, ensembling across model providers, RAG with external knowledge), Scaffold can't express it. Extending the library is future work.
Single model. All results use GPT-OSS-120B via OpenRouter. There is no evidence these results transfer to other models. The pipeline structure and prompt content were optimized for this specific model's behavior.
Small evaluation set. The merged dataset has 514 cases (split 28/16/56 for train/val/test). This is enough to see trends but not enough to make strong statistical claims. Confidence intervals would be wide.
No regression gate. The optimizer lacks an automated gate that detects per-domain regressions. It could improve LawBench while regressing S2D and select that candidate anyway. We've implemented this feature but haven't tested it on this benchmark yet.
Protocol difference. Meta-Harness and Scaffold operate under different optimization protocols. Meta-Harness is an online coding agent with filesystem access. Scaffold is an offline evolutionary optimizer with structured context. Direct comparison has inherent limitations — we're comparing the outcomes of two different paradigms, not running a controlled ablation.

Takeaways

Search space design matters as much as search algorithm. The difference between Scaffold and Meta-Harness isn't primarily about which LLM is smarter at optimization. It's about the space of candidates each system can express and how efficiently it navigates that space. A constrained, well-structured space (typed graphs, motif library, semantic dedup) produces valid candidates more reliably than an unconstrained one (arbitrary code).

Multi-step pipelines help on some tasks, not all. Scaffold wins on USPTO (10 classes) and LawBench (215 classes) but loses on S2D (22 classes). Label count alone doesn't predict when the pipeline helps. The optimizer's job is to find the right structure for the task — and sometimes the right structure is a single call.

Mixing LLM and deterministic steps is valuable. The repair gate is 15 lines of Python. It doesn't hallucinate, doesn't need prompt engineering, and adds negligible latency. The critique node produces structured JSON; the repair gate parses it and branches. This pattern — LLM for judgment, code for action — recurs across effective pipelines.

Structured feedback beats raw logs. A ~16k token context with pre-computed failure clusters and step-transition attribution gives the meta-agent more actionable signal than full filesystem context. Compression is not loss when it's the right compression.

Scaffold is open source. The seed graph, optimized output, and evaluation datasets from this comparison are in the repository. Reproduction is encouraged — scrutiny of the results is welcome, particularly the S2D regression and the USPTO anomaly.

github.com/andthattoo/scaffold

References

Lee et al., 2026. Meta-Harness: LLM-Driven Harness Optimization. yoonholee.com/meta-harness