Structured CoT: Shorter Reasoning with a Grammar File
Qwen3.6-27B made local models feel frontier-ish again on consumer-grade hardware. The model is crazy good for its size. The big problem I have with Qwen reasoning models in general is overthinking.
Yes, it thinks a lot.
Longer thinking helps reasoning. Chain-of-thought prompting showed that explicitly written intermediate reasoning can improve large model performance (Wei et al., 2022).
But for local inference, the blocker is the real-time feel. When the model overthinks, latency goes up, tokens-per-joule gets worse, and the session feels heavy. no_think does not work reliably for me, and it can hurt performance.
I tried a dumb inference-time trick: force a reasoning model's think block to follow a tiny grammar.
It started as a training idea, but I kept telling myself to explore the least resistant path first.
The mechanism is standard guided generation: constrain decoding with a grammar / finite-state machine rather than asking the model nicely. Outlines frames neural text generation as transitions through an FSM or parser state and precomputes valid token transitions over the vocabulary (Willard & Louf, 2023). I am using that old idea in a narrower place: not to make JSON valid, but to make the model's private scratchpad short.
root ::= think code
think ::= "<think>\n" "GOAL: " line "APPROACH: " line "EDGE: " line "</think>\n\n"
line ::= [^\n]+ "\n"
code ::= [\x09\x0A\x0D\x20-\x7E]+
On HumanEval+ (164 problems), using unsloth/Qwen3.6-35B-A3B-GGUF at Q4_K_M on one RTX6000 through llama-cpp-python, this reduced explicit thinking tokens by 22x on average with no performance loss in this run.
| Mode | pass@1 | mean thinking tokens | mean total tokens |
|---|---|---|---|
Free-form think | 92.1% | 3087 | 3410 |
Grammar-constrained GOAL/APPROACH/EDGE | 92.7% | 138 | 408 |
Prompt-only terse GOAL/APPROACH/EDGE | 93.3% | 2298 | 2764 |
| FSM vs FREE delta | +0.6 pp | 22.4x shorter | 8.4x shorter |
Format restrictions usually make models worse. Tam et al. find a significant reasoning decline under constrained output formats (Tam et al., 2024). This result is a narrow counterexample: the answer remains permissive, but the reasoning scratchpad is compressed.
See generation side by side on the same problem:
No distillation. No fine-tuning. No reward model. A ~20-line GBNF grammar applied only to the reasoning block matches full-thinking accuracy while using an order of magnitude fewer tokens.
The prompt-only terse control slightly improves pass@1 on this run, but still spends 2298 thinking tokens on average. The grammar is what reliably enforces the compact regime.
I call this Structured Chain-of-Thought.
Naming note: Li et al. also use the name "Structured Chain-of-Thought" for code generation, where the model writes CoT in programming structures like sequence, branch, and loop (Li et al., 2025). This post uses the same plain-language phrase differently: grammar/FSM-constrained decoding for the scratchpad at inference time. The grammar can be swapped by domain; it is not code-specific.
A harder check
I was not fully convinced because HumanEval+ is contaminated and simple. The next check was a recent LiveCodeBench v6 LeetCode slice: 50 problems, contest_date >= 2025-01-01, public functional tests.
I used a richer grammar:
root ::= think code
think ::= "<think>\n" "GOAL: " line "STATE: " line "ALGO: " line "EDGE: " line "VERIFY: " line "</think>\n\n"
line ::= [^\n]+ "\n"
code ::= [\x09\x0A\x0D\x20-\x7E]+
This is not "reasoning inside the circuitry." It is simpler and cleaner than that: an inference harness sitting between the sampler and the model's answer.
LiveCodeBench matters here because HumanEval is old and heavily contaminated. I filtered to contest_date >= 2025-01-01 so the slice is more recent than most benchmark memorization pressure and closer to the model's release window. This does not prove zero contamination, but it is a better stress test than HumanEval+.
Budget: 16384 tokens, which is quite large for local compute.
| LiveCodeBench v6 mode | pass@1 | mean thinking tokens | mean total tokens | extraction issues |
|---|---|---|---|---|
Free-form think | 50.0% | 11553 | 13632 | empty_code=18 |
FSM_PLAN GOAL/STATE/ALGO/EDGE/VERIFY | 64.0% | 267 | 2743 | none |
| FSM_PLAN vs FREE delta | +14.0 pp | 43.3x shorter | 5.0x shorter | - |
This is a LiveCodeBench public-test result, not an official private leaderboard score. The improvement is partly mechanical, but not only empty-code recovery. FREE had 25 failures: empty_code=18, syntax_error=4, missing_entry_point=1, timeout=1, wrong_answer=1. The grammar often forces the model to leave the scratchpad and get to code, but it also reduces answer-channel drift where the model finishes thinking, starts writing, and still emits malformed or misaligned code.
Some rollouts show the pattern:
| Task | FREE | FSM_PLAN |
|---|---|---|
| 3702 | fail, 16384 think, empty code | pass, 114 think / 454 total |
| 3677 | fail, 6269 think / 16384 total, only 2 code lines, indentation error | pass, 216 think / 1654 total |
| 3771 | fail, 15820 think / 16384 total, code emitted, timeout | pass, 145 think / 771 total |
| 3794 | fail, 12914 think / 16384 total, 28 code lines, wrong answer | fail, 166 think / 807 total |
| 3750 | pass, 15241 think / 15717 total | pass, 182 think / 724 total |
A few raw output fragments make this less abstract:
3702 FREE: think_tokens=16384, total_tokens=16384, extracted_code=""
3702 FSM: GOAL/STATE/ALGO/EDGE/VERIFY in 114 think tokens, then runnable code
3677 FREE:
class Solution:
def maximumAmount(self, coins: List[List[int]]) -> int:
# evaluator: IndentationError: expected an indented block
3771 FREE: emitted 27 code lines after 15820 think tokens, then TIMEOUT
3794 FREE: emitted 28 code lines, then wrong_answer on public test 0
The caveat: on harder tasks, the model can move deliberation into comments or post-think answer text. So <think> tokens are not enough. Total tokens, post-think tokens, answer-channel bloat, and comment bloat all need to be tracked.
Limitations and open questions
-
Contamination. HumanEval has been in training corpora for years. All modes may be recalling solutions, not reasoning. The "FSM matches FREE" result could mean "grammar extracts the same memorized solution in fewer tokens" rather than "grammar preserves reasoning capability." LiveCodeBench v6 reaches April 2025, so it is a useful lower-contamination check, but not a strict post-Qwen3.6 cutoff.
-
Grammar specificity.
GOAL/APPROACH/EDGEwas tuned for short coding tasks. The LiveCodeBench grammar addsGOAL/STATE/ALGO/EDGE/VERIFYwhile leaving the answer permissive. Math, logic, planning, and agentic tasks may need different compressed formats. I do not know whether a universal compressed-thinking grammar exists. -
Reasoning depth. These are one-shot coding problems. SWE-Bench, long-horizon planning, and multi-file agentic tasks may break this. Compression can skip the edge case verbose reasoning would have found.
-
Model dependency. These runs use one local Qwen3.6 MoE quant. I do not know whether smaller models can follow the same grammar without capability loss, or whether this requires scale.
-
Prompt optimization. Better task-specific grammars may be searchable. GEPA-style prompt optimization could explore the grammar format itself, not only the prompt around it.
The clean takeaway is not that terse prompting is enough. Prompt-only can change outcomes, but it does not reliably control reasoning length. A grammar constraint gives you a hard budget on the shape of thought while leaving the answer channel open.
Cite this
@misc{omers2026structuredcot,
title = {Structured CoT: Shorter Reasoning with a Grammar File},
author = {Kaya Omer},
year = {2026},
month = {April},
url = {https://andthattoo.dev/blog/structured_cot}
}
References
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903, 2022.
- Brandon T. Willard, Rémi Louf. Efficient Guided Generation for Large Language Models. arXiv:2307.09702, 2023.
- Liangsheng Yin, Ying Sheng, Lianmin Zheng. Fast JSON Decoding for Local LLMs with Compressed Finite State Machine. LMSYS Blog, 2024.
- Jia Li, Ge Li, Yongmin Li, Zhi Jin. Structured Chain-of-Thought Prompting for Code Generation. ACM Transactions on Software Engineering and Methodology 34(2), Article 37, 2025.
- Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. arXiv:2408.02442, 2024.