IronText: A Long-Horizon Test for LLMs with an Exact Oracle

I built small text worlds where the optimal move is exactly computable, and ran claude-opus-4.8 against them with an exact Bayesian filter grading every turn.

It lost 63% of the episodes.

Classifying every loss by the model's stated belief at the fatal commitment gives one fingerprint in 20 of 22 cases: the model attributes a delayed observation to the reference frame at arrival time instead of issue time, so the value it commits is off by exactly −1. One observation-format change, provably adding zero information to the task, fixes every paired seed, 10/10.

The apparatus is called IronText. The end-to-end chain on one task family:

stepfindingevidence
capture37% win rate vs oracle 1.0030 paired rollouts, 3 vocabularies
mechanismone bug: observations read in the arrival-time frame instead of issue-time, value off by exactly −120/22 losses
hazard~10% slip per irreversible commitment, compounding geometricallydepth sweep, 1 to 4 commitments
moderatortheming the numeric channel doubles the slip ratemasquerade 1/10 vs abstract 5/10
ablationstamp observations with their issue-time frame: 10/10 wins, all at certaintypaired A/B, zero information added

The campaign is the spine of this post. The payoff comes after it: the four carry regimes are four different verbs, frontier models are missing exactly one of them, and the general skill is triage, knowing which verb a variable deserves.

What makes a task long-horizon?

Not reasoning depth. I measured symbolic deduction to depth 15 and found no gap. Not search either.

What breaks frontier models is carrying state. A task is long-horizon when the world requires information it will not re-transport, so the agent's own memory is the only carrier, and one carrying error is never corrected.

Why exact worlds

The instrument is a small POMDP rendered as text, hundreds to thousands of joint states, fully known to the harness. Three properties do all the work.

The optimal belief is exactly computable. An exact Bayesian filter, the oracle, runs beside every episode, so I can see turn by turn where the model's belief departs from the best achievable one.

The rules are told to the agent. The task is tracking-given-known-dynamics, never rule discovery. Failure is attributable to execution, not ignorance.

Worlds are cheap and infinite. A world factors into a certified mechanism, audited before any model token is spent, and an interchangeable skin: a vocabulary applied at the interface only, brief, menu strings, reading rendering. A skin never touches the mechanism, so the world a model plays is byte-identical to the world the audit certified. The certificate transfers to every skin for free, fresh items are unlimited, and there is nothing to memorize.

Why small?

Because exactness buys more than scale. The filter enumerates only the hidden joint, publicly known dimensions are pinned, and every claim downstream inherits the exactness. The design rule throughout: everything that needs trust is computed; everything an LLM touches is checked.

Hard-looking is not hard

The first lesson cost a graveyard of worlds. Hidden, moving state is necessary but nowhere near sufficient. I learned this by building hard-looking tasks and watching frontier models win them.

There are three escape hatches. Re-anchoring: if any observation can correct a wrong belief, errors wash out. Re-derivability: if state is overwritten rather than accumulated, it can be recomputed on demand; build state in a bugfix world is an overwriting latch, so bugfix tasks test planning, not tracking. Reconstructability: if the dynamics are deterministic and the history is in context, a chain-of-thought model does not track at all, it computes the state from the transcript. Opus solved a version-pinning world by writing the linear system over ℤ₄ in one turn. DeepSeek tracked 50 shell-game swaps in its scratchpad.

This is the architectural distinction Mozer, Siddiqui, and Liu draw: retrieving old context is not the same as maintaining an updated state, and a model that can reread everything recomputes rather than tracks whenever the transcript lets it.

This yields a four-way taxonomy of carries, decided by the oracle's support trajectory, the number of hypotheses still alive over time:

regimesupport trajectorygaps a frontier model?
RECONSTRUCTABLEstays at 1 (state is a function of history)no, CoT computes it
DEFUSEDspreads, re-collapses when probedno, a smart agent probes
IRREDUCIBLEnever collapsesyes, scored by belief quality
EDGEballoons, collapses only under skilled probingyes, clean ceiling, legible failures

EDGE is the regime worth building: uncertainty loaded up front, killed slowly and only by well-chosen probes, collapsing just before the deadline.

A task can only detect agency to the extent that probe value depends on current belief.

The graveyard's exploits are closed by construction. Irreversible commitments between probe phases stop a model from deferring all inference to the end of the transcript, and they price obliviousness: a blind policy survives kk qq-way commitments with probability qkq^{-k}, no matter how good its probes. Counting feedback with no algebraic closure stops chain-of-thought from carrying a tidy summary, a coset or a permutation, instead of the belief itself.

One more dial. Observations pay down uncertainty at most log2Z\log_2|\mathcal{Z}| bits each, which gives a budget tightness

β=bits owedbits obtainable=nlogK(HL)logZ\beta = \frac{\text{bits owed}}{\text{bits obtainable}} = \frac{n \log K}{(H - L)\,\log|\mathcal{Z}|}

for nn hidden variables of KK values, horizon HH, and LL turns reserved for acting. β>1\beta > 1 is unsolvable, β1\beta \ll 1 means sloppy play suffices, β1\beta \lesssim 1 is the edge: just-in-time resolution. Empirically exact on the flagship task: a 6-charge probe budget is winnable, 5 is not.

The corridor cascade

The flagship EDGE task. Three hidden digits, static, never emitted. An observed dial rotates every turn. Probes answer through the dial, doors consume what you know.

hidden:  b0  b1  b2              digits in 0..K-1, static, never emitted
dial:    0 -> 1 -> 2 -> 3 -> 0   observed, +1 mod K every turn

[ probe block 0 ]--door c=b0 mod q-->[ probe block 1 ]--door c=b1 mod q-->[ probe block 2 ]--open v

scan_i :  panel lights iff (b_i + dial) mod K < K/2    same probe, different phase, different bit
survey :  counting feedback over a pair                no algebraic closure, carry the set
door   :  one-way, permanent; beyond a wrong door every panel reads blank, and blank is ambiguous
open v :  wins iff v = b2 and every door was correct

The dial is the trap. The same scan is informative at one phase and redundant at another, so when you probe matters, and every reading must be remembered together with the frame it used. That pairing is the transport carry, and it is exactly what the campaign below catches a frontier model dropping.

Three reference policies bracket what the task can measure: random (does time alone resolve it?), greedy (cautious adaptive info-gain with certainty-gated commitments, sound by adaptive submodularity), and openloop (the same planner never conditioned on observations, the probe-script exploit as a policy). Acceptance at 3 blocks, K=4, q=2: greedy collapses support 64→1 in 8 steps and commits at certainty, openloop never dares a door, random never resolves. Exploration leverage ~15x, adaptivity leverage ~16x. That certificate is stamped on the world before any model plays it.

Capture: the gap

Everything below is the corridor cascade at 3 blocks, K=4, q=2, an 18-turn budget against the oracle's 8-step line, versus anthropic/claude-opus-4.8. The exact filter runs as a sidecar; the model self-reports beliefs and they are graded against it. The design is paired: the same seed produces the same hidden codes under every vocabulary.

vocabularywins
abstract5/10
masquerade (hand skin, themed values)1/10
reactor (LLM-authored skin, raw values)5/10
total11/30 (37%)

The oracle wins 100% of these. The oblivious floor is q21/46%q^{-2} \cdot 1/4 \approx 6\%: probe however you like, but without belief-dependent decisions you are guessing two binary doors and a final digit. 37% sits well above blind and nowhere near the ceiling. Interface noise is zero (invalid actions ≈ 0 after the parser fix described below), so the gap is cognitive, not mechanical.

Mechanism: one bug

Classifying every loss by the model's stated belief at the moment of the fatal commitment:

20 of 22 loss events share one fingerprint: the stated or committed value is exactly one dial-phase behind the truth (δ = −1 mod K).

The two exceptions are rational deadline guesses at residual uncertainty, random commits, not bookkeeping outputs. The bug:

turn t   : scan_1 issued          dial = 2
turn t+1 : panel result arrives   dial = 3

panel answers the frame at issue  : (b1 + 2) mod 4
model reads the frame at arrival  : (b1 + 3) mod 4
inferred b1 = truth - 1           : every time

The brief states the issue-time convention explicitly. The model reads the panel against the dial phase shown when the result arrives anyway, and shifting the frame by one shifts the inferred digit by exactly −1.

Three corroborations pin it as a mechanism rather than noise:

  • Same seed, same wrong value, across vocabularies. Seed 2 states b1=3b_1 = 3 against a truth of 0 under both abstract and masquerade; seed 7 states b1=2b_1 = 2 against a truth of 3 under masquerade and reactor. The wrong posterior is a deterministic function of the observation sequence.
  • Full information does not protect. Two losses occurred with the oracle at support 1, P(true)=1.00P(\text{true}) = 1.00. The history fully determined the answer and the model opened truth − 1. The slip is in reading the evidence, not lacking it.
  • The convention degrades within an episode. The model used issue-time correctly at block 0 and silently switched to arrival-time at block 1. Stability falls with context depth, which is also where the slips cluster.

Hazard: geometric in commitments

Sweep the number of irreversible doors, 3 seeds per rung:

blocks (commitments)2 (1)3 (2)4 (3)5 (4)
wins3/32/32/32/3
(1p)k(1-p)^k at p=0.10p = 0.100.900.810.730.66

Pooled over all non-masquerade runs, the slip rate is roughly 0.10 to 0.15 per irreversible commitment, flat across depth. Success composes as

P(win)(1p)k,P(\text{win}) \approx (1 - p)^k,

the long-horizon law in commitment units rather than turn units. The fingerprint holds at depth: the two new bookkeeping losses, doors 1 and 3, fifteen turns in, are both δ = −1. And the model's decision layer stayed sound throughout. It derived the foreclosure hypothesis from blank readings and resigned rationally when doomed.

The decision layer is sound. The convention layer slips.

Moderator: the vocabulary effect

Masquerade lost 9 of 10; abstract and reactor lost 10 of 20 combined (Fisher exact p ≈ 0.04). The discriminating feature is not theming per se. The reactor skin renames actions and keys but keeps raw value labels, and it matches abstract exactly, 5/10 both. Masquerade themes the numeric value channel itself: "round 2", "1 nods", "chosen 1". It is also the only condition producing door-0 slips, four of them. Value-channel translation both raises the slip rate and pushes slips earlier.

Theming actions and keys is free. Theming the numeric channel roughly doubles the hazard.

That is a conjecture from 30 runs, now testable at scale, and the experiment exists at all because skins provably cannot touch the mechanism: same world bytes, same certificate, same seeds.

Ablation: the stamp

One spec knob, stamped=true. The panel state carries its frame-at-issue and the reading renders as 1@phase2 instead of 1. This deletes exactly the identified carry, remembering when each scan was issued, and provably adds no information to the task: the exact filter already knows the dial history, and the acceptance certificate is byte-identical.

unstamped: greedy 1.00 (t*=8)  openloop 16.00  random 15.12  leverage 15.1x / 16.0x
stamped:   greedy 1.00 (t*=8)  openloop 16.00  random 15.12  leverage 15.1x / 16.0x

The first knob that moves the agent without moving the oracle. A pure cognitive-load dial. Paired A/B on the same ten seeds:

seedunstampedstamped
0, 1, 3, 5, 6WINWIN
2, 4, 7, 8, 9LOSSWIN

10/10 wins, every commitment at oracle P(true)=1.00P(\text{true}) = 1.00, no gambles, no resignations, mean 10.3 steps against the 8-step oracle line. Five losses fixed, zero regressions (paired exact p ≈ 0.03). Even the one loss that never fit the fingerprint, a blind open after 4 turns, disappeared.

The claim

Claude Opus 4.8 has a specific, reproducible convention instability. It attributes delayed observations to arrival-time rather than issue-time reference frames, firing at roughly 10% per irreversible commitment under load, roughly doubled by numeric-channel theming, invariant in mode across vocabularies, and removed by a single observation-format change that adds no information to the task.

Not "model X scores Y%". A captured gap with an identified mechanism, a measured hazard, a moderator, and a confirming ablation.

The practical rule survives outside the lab. Any agent harness with delayed tool results and a changing context has this exact shape: an async tool call issued under one context, resolving under another, with the model left to remember which was which.

Stamp observations with their issue-time context.

Four regimes, four verbs

The taxonomy from earlier is not just a filter for deciding which worlds are worth running. Read it again as a job description. Each regime demands a different verb from the agent, and the calibration anchors, the hand-built worlds that stay in the suite forever, measure one verb each.

regimethe verbanchor worldsfrontier models
RECONSTRUCTABLEcompute it from the transcriptroom navigation, shell game, version pinninghave it
DEFUSEDprobe it, let errors wash outthe vault, the gcc regression hunthave it
EDGEgrind it to certainty before committingthe corridor cascadealmost: the ~10% slip above
IRREDUCIBLEhold a distribution, act under itthe war campaignmissing: 0/3

The first two verbs are solved, and the section on hard-looking worlds already showed it: a scratchpad computes reconstructable state (the ℤ₄ repo solve, the 50 tracked swaps), and any world that lets probes correct a wrong belief gets defused (the vault's moon leaking through a retryable spell, gcc's regressions resurfacing in CI). These anchors gap nothing, and that is their job: they pin the known verdicts the certifier must reproduce.

The third verb is the campaign above: a player that probes adaptively, commits at certainty, and slips on one bookkeeping convention 10% of the time.

The fourth verb is missing outright. The war campaign is the IRREDUCIBLE anchor: a coupled stochastic war economy under fog whose hidden front never resolves. Even the oracle commits under uncertainty, so episodes are scored by belief divergence and decision regret rather than win/lose. The task is solvable blind: a QMDP tracker using only the observable history wins 0.55 to 0.62 against a clairvoyant ceiling of 0.89. Three frontier models scored 0/3. Each maintained a monotonic point estimate where the task demanded a distribution; at one point Opus's front belief was 85% wrong while the exact filter's sat at 5%.

The corridor failure is a slip in an otherwise sound player. The war failure is a missing verb: no amount of stamping repairs an agent that will not carry a posterior.

General long-horizon competence is triage: compute what is reconstructable, probe what is defused, grind edge carries to certainty, and hold distributions over what never resolves.

Real tasks mix the regimes inside one world, and the mixture is where triage becomes measurable. The repo world is the cleanest miniature: build state is an overwriting latch, re-derivable on demand, so bugfix tasks test planning, not tracking; the genuine carry in a coding agent's life is version and lockfile drift. An agent that burns scratchpad on build state is spending its carry budget on a variable the world will hand back for free. The certifier already classifies per variable and scores a world by its worst one, so the next instrument is direct: worlds with one variable per regime, graded on where the model spends its memory. The same design is the curriculum endgame if these worlds train models instead of measuring them. A model rewarded across mixed-regime worlds is rewarded for triage itself, not for any one trick.

What else fell out

The oracle audits the harness too. The first reactor run "lost" with the model's prose claiming certainty while the oracle said P = 0.50. The disagreement exposed a parser bug: the action-JSON extractor choked on set-notation braces like b0∈{2,3}, which is exactly the explicit belief reasoning the task elicits. 8 of 9 invalid turns recovered after the fix, including two attempts to commit the correct answer. Every number above postdates the fix.

The authoring pipeline survived first contact. Worlds are generated by an LLM doing only constrained choice: a spec, schema-validated with error-feedback retries, and a skin, serialized as literal maps over finite label sets with bijectivity and coverage checked mechanically. The first LLM-authored world shipped a malformed skin field; the validators caught the shape error at the boundary, the mechanism and certificate were never at risk, and the artifact was repaired without re-certification.

Limitations and open questions

  1. One model, one family, end to end. The full capture-to-ablation chain exists for Opus 4.8 on the corridor cascade. Other models may slip at a different rate, in a different place, or not at all; the war result suggests distribution maintenance is shared, but the δ = −1 fingerprint has not been hunted elsewhere yet.
  2. Small samples. 30 paired rollouts, 3 seeds per depth rung, 10 ablation seeds. The p-values are real but modest (0.04, 0.03). The numeric-channel conjecture in particular needs the generation pipeline running at scale.
  3. Self-reported beliefs. Loss classification reads the model's stated belief at the fatal commitment. The committed values are graded against ground truth, which anchors the fingerprint, but a model that tracks correctly and verbalizes sloppily would be misread.
  4. Hazard flatness is measured to 4 commitments. Whether pp stays flat at depth 10, where context length becomes its own variable, is open. The within-episode convention drift says it gets worse.
  5. Is the bug Opus-specific or architectural? A decoder LLM reading a delayed observation against the most recent context token is a plausible general failure mode. The stamp helps either way, but the science differs.

Cite this

@misc{omers2026irontext,
  title        = {IronText: A Long-Horizon Test for LLMs with an Exact Oracle},
  author       = {Kaya Omer},
  year         = {2026},
  month        = {June},
  url          = {https://andthattoo.dev/blog/irontext}
}

References