Structured CoT: Shorter Reasoning with a Grammar File

Qwen3.6-27B made local models feel frontier-ish again on consumer-grade hardware. The model is crazy good for its size. The big problem I have with Qwen reasoning models in general is overthinking.

Yes, it thinks a lot.

Longer thinking helps reasoning. Chain-of-thought prompting showed that explicitly written intermediate reasoning can improve large model performance (Wei et al., 2022).

But for local inference, the blocker is the real-time feel. When the model overthinks, latency goes up, tokens-per-joule gets worse, and the session feels heavy. no_think does not work reliably for me, and it can hurt performance.

I tried a dumb inference-time trick: force a reasoning model's think block to follow a tiny grammar.

It started as a training idea, but I kept telling myself to explore the least resistant path first.

The mechanism is standard guided generation: constrain decoding with a grammar / finite-state machine rather than asking the model nicely. Outlines frames neural text generation as transitions through an FSM or parser state and precomputes valid token transitions over the vocabulary (Willard & Louf, 2023). I am using that old idea in a narrower place: not to make JSON valid, but to make the model's private scratchpad short.

root  ::= think code
think ::= "<think>\n" "GOAL: " line "APPROACH: " line "EDGE: " line "</think>\n\n"
line  ::= [^\n]+ "\n"
code  ::= [\x09\x0A\x0D\x20-\x7E]+

On HumanEval+ (164 problems), using unsloth/Qwen3.6-35B-A3B-GGUF at Q4_K_M on one RTX6000 through llama-cpp-python, this reduced explicit thinking tokens by 22x on average with no performance loss in this run.

Modepass@1mean thinking tokensmean total tokens
Free-form think92.1%30873410
Grammar-constrained GOAL/APPROACH/EDGE92.7%138408
Prompt-only terse GOAL/APPROACH/EDGE93.3%22982764
FSM vs FREE delta+0.6 pp22.4x shorter8.4x shorter

Format restrictions usually make models worse. Tam et al. find a significant reasoning decline under constrained output formats (Tam et al., 2024). This result is a narrow counterexample: the answer remains permissive, but the reasoning scratchpad is compressed.

See generation side by side on the same problem:

No distillation. No fine-tuning. No reward model. A ~20-line GBNF grammar applied only to the reasoning block matches full-thinking accuracy while using an order of magnitude fewer tokens.

The prompt-only terse control slightly improves pass@1 on this run, but still spends 2298 thinking tokens on average. The grammar is what reliably enforces the compact regime.

I call this Structured Chain-of-Thought.

Naming note: Li et al. also use the name "Structured Chain-of-Thought" for code generation, where the model writes CoT in programming structures like sequence, branch, and loop (Li et al., 2025). This post uses the same plain-language phrase differently: grammar/FSM-constrained decoding for the scratchpad at inference time. The grammar can be swapped by domain; it is not code-specific.

A harder check

I was not fully convinced because HumanEval+ is contaminated and simple. The next check was a recent LiveCodeBench v6 LeetCode slice: 50 problems, contest_date >= 2025-01-01, public functional tests.

I used a richer grammar:

root  ::= think code
think ::= "<think>\n" "GOAL: " line "STATE: " line "ALGO: " line "EDGE: " line "VERIFY: " line "</think>\n\n"
line  ::= [^\n]+ "\n"
code  ::= [\x09\x0A\x0D\x20-\x7E]+

This is not "reasoning inside the circuitry." It is simpler and cleaner than that: an inference harness sitting between the sampler and the model's answer.

LiveCodeBench matters here because HumanEval is old and heavily contaminated. I filtered to contest_date >= 2025-01-01 so the slice is more recent than most benchmark memorization pressure and closer to the model's release window. This does not prove zero contamination, but it is a better stress test than HumanEval+.

Budget: 16384 tokens, which is quite large for local compute.

LiveCodeBench v6 modepass@1mean thinking tokensmean total tokensextraction issues
Free-form think50.0%1155313632empty_code=18
FSM_PLAN GOAL/STATE/ALGO/EDGE/VERIFY64.0%2672743none
FSM_PLAN vs FREE delta+14.0 pp43.3x shorter5.0x shorter-

This is a LiveCodeBench public-test result, not an official private leaderboard score. The improvement is partly mechanical, but not only empty-code recovery. FREE had 25 failures: empty_code=18, syntax_error=4, missing_entry_point=1, timeout=1, wrong_answer=1. The grammar often forces the model to leave the scratchpad and get to code, but it also reduces answer-channel drift where the model finishes thinking, starts writing, and still emits malformed or misaligned code.

Some rollouts show the pattern:

TaskFREEFSM_PLAN
3702fail, 16384 think, empty codepass, 114 think / 454 total
3677fail, 6269 think / 16384 total, only 2 code lines, indentation errorpass, 216 think / 1654 total
3771fail, 15820 think / 16384 total, code emitted, timeoutpass, 145 think / 771 total
3794fail, 12914 think / 16384 total, 28 code lines, wrong answerfail, 166 think / 807 total
3750pass, 15241 think / 15717 totalpass, 182 think / 724 total

A few raw output fragments make this less abstract:

3702 FREE: think_tokens=16384, total_tokens=16384, extracted_code=""
3702 FSM:  GOAL/STATE/ALGO/EDGE/VERIFY in 114 think tokens, then runnable code

3677 FREE:
class Solution:
    def maximumAmount(self, coins: List[List[int]]) -> int:
# evaluator: IndentationError: expected an indented block

3771 FREE: emitted 27 code lines after 15820 think tokens, then TIMEOUT
3794 FREE: emitted 28 code lines, then wrong_answer on public test 0

The caveat: on harder tasks, the model can move deliberation into comments or post-think answer text. So <think> tokens are not enough. Total tokens, post-think tokens, answer-channel bloat, and comment bloat all need to be tracked.

Limitations and open questions

  1. Contamination. HumanEval has been in training corpora for years. All modes may be recalling solutions, not reasoning. The "FSM matches FREE" result could mean "grammar extracts the same memorized solution in fewer tokens" rather than "grammar preserves reasoning capability." LiveCodeBench v6 reaches April 2025, so it is a useful lower-contamination check, but not a strict post-Qwen3.6 cutoff.

  2. Grammar specificity. GOAL/APPROACH/EDGE was tuned for short coding tasks. The LiveCodeBench grammar adds GOAL/STATE/ALGO/EDGE/VERIFY while leaving the answer permissive. Math, logic, planning, and agentic tasks may need different compressed formats. I do not know whether a universal compressed-thinking grammar exists.

  3. Reasoning depth. These are one-shot coding problems. SWE-Bench, long-horizon planning, and multi-file agentic tasks may break this. Compression can skip the edge case verbose reasoning would have found.

  4. Model dependency. These runs use one local Qwen3.6 MoE quant. I do not know whether smaller models can follow the same grammar without capability loss, or whether this requires scale.

  5. Prompt optimization. Better task-specific grammars may be searchable. GEPA-style prompt optimization could explore the grammar format itself, not only the prompt around it.

The clean takeaway is not that terse prompting is enough. Prompt-only can change outcomes, but it does not reliably control reasoning length. A grammar constraint gives you a hard budget on the shape of thought while leaving the answer channel open.

Cite this

@misc{omers2026structuredcot,
  title        = {Structured CoT: Shorter Reasoning with a Grammar File},
  author       = {Kaya Omer},
  year         = {2026},
  month        = {April},
  url          = {https://andthattoo.dev/blog/structured_cot}
}

References