On This Page

Constrained Decoding: How to Get Guaranteed JSON from an LLM (and the Reasoning Tax)

How constrained decoding guarantees valid JSON from an LLM: runnable vLLM and structured-output examples, the latency cost, and the reasoning tax that JSON-mode hides.

Intermediate45 min

Prerequisites

  • Python 3.10+
  • vLLM 0.19+ and a GPU that can serve the model (or a smaller Qwen3.6 dense variant on a 24GB card)
  • the datasets and pydantic libraries
Roei ZPublished Jun 29, 2026
Diagram of a full next-token logit distribution being masked down to only the grammar-valid tokens, producing valid JSON by construction

If your application parses the model's output, prompt-and-pray is a liability. Ask a model to "respond in JSON" and most of the time it complies, but the failure rate is never zero: a stray markdown fence, a trailing comma, a truncated object, a hallucinated field. At a few thousand calls a day, a 1% malformed rate is dozens of broken requests, and the agent downstream does not care that the model was "usually" right. Constrained decoding removes that failure mode entirely by making invalid output impossible to generate. The catch, and the reason this is a guide rather than a one-liner, is that applied naively it can quietly make your model worse at the actual task. This piece covers how to guarantee structure, what it costs in latency, and the reasoning tax that JSON-mode hides.

Why "just ask for JSON" is not enough

Prompting for a format is a request, not a guarantee. The model samples tokens from a distribution, and nothing in that process prevents it from emitting a token that breaks your schema. You can lower the failure rate with few-shot examples and stern instructions, but you cannot drive it to zero, and the failures cluster exactly where you least want them: long outputs that get truncated, edge-case inputs that confuse the model, and high load when you are retrying anyway. Schema validation plus retries papers over it at the cost of latency and spend. Constrained decoding solves it at the source.

How constrained decoding works

The mechanism is simple and worth understanding, because every tool is a variation on it. At each decoding step the model produces logits over the whole vocabulary. Constrained decoding inserts one operation before sampling: compute the set of tokens that are valid given the structure so far, and set the logits of every invalid token to negative infinity. The model then samples only from tokens that keep the output legal. Do that at every step and the result is valid by construction, not by luck.

What defines "valid" is a state machine. For a regular pattern (a regex, a fixed set of choices) a finite-state machine tracks the position. For a full grammar like JSON, which has nesting, you need a pushdown automaton, essentially a collection of FSMs with a stack. The current tools differ mainly in how cheaply they compute that token mask:

  • Outlines compiles a regex or JSON schema into an FSM and indexes the allowed tokens per state.
  • XGrammar uses a pushdown automaton with aggressive precomputation and caching to achieve near-zero-overhead mask generation, and it is the default backend in vLLM and SGLang for this reason.
  • llguidance and lm-format-enforcer are alternative engines with similar guarantees and different performance profiles.

You rarely call these directly. You declare the schema you want and the serving stack applies the masking for you.

Implementation: guaranteed JSON in vLLM

vLLM exposes structured outputs (also called guided decoding) through four constraint types: guided_json for a JSON schema, guided_choice for an exact set of options, guided_regex for a pattern, and guided_grammar for a full context-free grammar. The cleanest path is to define the shape as a Pydantic model and hand vLLM its JSON schema.

Offline generation

python
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

class Ticket(BaseModel):
    category: str
    priority: int          # the schema can constrain types and ranges
    needs_human: bool
    summary: str

llm = LLM(model="Qwen/Qwen3.6-27B")  # use the current model you serve

email = ("Subject: Still no refund\n\n"
         "I returned my order two weeks ago and have not seen the refund. "
         "Order #48213. This is getting urgent.")

guided = GuidedDecodingParams(json=Ticket.model_json_schema())
params = SamplingParams(temperature=0.3, max_tokens=256, guided_decoding=guided)

out = llm.generate(["Triage this support email:\n\n" + email], params)
print(out[0].outputs[0].text)   # guaranteed to parse as a Ticket

The output is guaranteed to be a JSON object matching Ticket. No retries, no regex cleanup, no markdown fences to strip.

Through the server and OpenAI-compatible API

In production you serve the model and call it over HTTP. Select the backend explicitly if you want to pin it:

bash
vllm serve Qwen/Qwen3.6-27B --guided-decoding-backend xgrammar --port 8000
python
from openai import OpenAI
from pydantic import BaseModel

class Ticket(BaseModel):
    category: str
    priority: int
    needs_human: bool
    summary: str

client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[{"role": "user", "content": "Triage this support email:\n\n" + email}],
    extra_body={"guided_json": Ticket.model_json_schema()},
)
print(resp.choices[0].message.content)

Hosted APIs expose the same capability under their own names, usually a response_format set to a JSON schema, with the provider applying constrained decoding server-side. The principle is identical: you supply the schema, the decoder enforces it.

Beyond JSON: choices, regex, and grammars

JSON is the common case, but the same masking machinery enforces any structure you can describe, and the simpler the constraint the more reliable it is.

A fixed choice is the single most robust structured output. When the answer must be one of a known set, constrain to exactly that set and the model cannot editorialize:

python
extra_body={"guided_choice": ["positive", "negative", "mixed", "neutral"]}
# output is exactly one of the four strings, never a sentence about sentiment

A regex pins a pattern when you need a specific lexical shape, like an ISO date or an identifier, with nothing around it:

python
extra_body={"guided_regex": r"\d{4}-\d{2}-\d{2}"}   # a date, and only a date

A grammar handles anything with structure JSON cannot express, such as a restricted query language or a domain DSL. You hand the decoder an EBNF-style grammar and it guarantees only well-formed strings in that language:

python
sql_subset = r'''
root    ::= "SELECT " columns " FROM " name
columns ::= "*" | name ("," name)*
name    ::= [a-z_]+
'''
extra_body={"guided_grammar": sql_subset}
# the model can only emit SELECT queries in your subset, never arbitrary SQL

The reliability ranking is worth internalizing: a fixed choice is essentially bulletproof, a regex is very strong, a flat JSON schema is strong, and a deeply nested or highly permissive grammar is where the automaton gets large and the edges get rough. Constrain to the tightest structure the task actually needs.

Function calling is constrained decoding in disguise

If you have used tool or function calling, you have already used constrained decoding without naming it. Every tool call an agent makes is a JSON object that must match the tool's parameter schema, and the reason modern function calling rarely produces malformed arguments is that providers constrain generation to that schema under the hood. When you register an MCP tool or an OpenAI function, its parameter definition is exactly the schema the decoder enforces on the arguments.

This makes the reasoning tax concrete for agents. If you force the model to emit a tool call immediately, you have constrained away the step where it decides whether to call a tool at all and which one. The reliable pattern is the same as before: let the model reason in free text about what to do, then emit the constrained call. Agents that "call the wrong tool confidently" are often agents that were given no room to think before the schema took over.

The latency tax, and how small it now is

Constrained decoding adds work at every token: computing the valid-token mask. Two or three years ago that overhead was real and sometimes severe. Today, with a pushdown-automaton engine like XGrammar and the rewritten structured-output path in current vLLM, the per-token overhead is minimal, marginally higher time-per-output-token than unconstrained generation, especially when the same grammar is reused across requests so the automaton is cached.

There is one regime to watch. The mask generation in some guided-decoding paths is sequential and not fully overlapped with model compute, so at higher batch sizes (roughly batch 8 and up) you can see a throughput drop relative to unconstrained serving. Backend choice matters here: the XGrammar backend gives low time-per-output-token and shines when grammars are reused, while the guidance backend gives faster time-to-first-token for dynamic or rarely-repeated grammars. The practical rule is the same as everywhere in this stack: measure structured throughput at your real batch size, not at batch one, because the convenient-regime number is not the production number.

The reasoning tax, which is the part that matters

Now the failure mode that the schema-validity guarantee hides, and the reason you cannot just slap guided_json on every call. Forcing the model into a rigid format during generation can degrade the quality of its reasoning, sometimes a lot.

The clearest evidence is Let Me Speak Freely? (Tam et al., 2024), which tested format restrictions across reasoning tasks and found a consistent decline: math, symbolic reasoning, and complex analysis dropped meaningfully under strict JSON-mode compared to free-form generation followed by parsing, with tighter constraints producing larger drops. The mechanism is intuitive once you see it. A model reasons in tokens. When the next token must satisfy a schema, you are taking away exactly the intermediate "thinking" tokens, the scratch work, the false starts, the step-by-step, that the model uses to get to a good answer. Constraining the format constrains the computation.

This is the core reality-gap lesson for structured output: a schema-valid answer can be a worse answer. The guarantee you bought (it parses) is orthogonal to the thing you actually care about (it is correct), and chasing the first can cost you the second.

The fix is to separate thinking from formatting. Two patterns work:

Put a reasoning field first in the schema. Schema fields generate in order, so a free-text field before the answer field gives the model room to think inside the structured output:

python
class Solution(BaseModel):
    reasoning: str        # free-form scratchpad, generated FIRST
    final_answer: str     # the committed answer, generated after

# The model reasons in `reasoning`, then the constrained `final_answer`
# benefits from that thinking instead of replacing it.

Or split into two calls. Let the model reason in unconstrained text, then make a second, cheap constrained call that only extracts the structured answer from the reasoning. This fully recovers reasoning quality at the cost of an extra call, and it is the right choice when the task is hard and the answer is small.

For reasoning models with an explicit thinking phase, let the thinking happen outside the constraint and apply the schema only to the final response. The rule of thumb: never wrap the reasoning in the schema, only the result.

Measure the reasoning tax yourself

Borrowing a paper's number is not the same as knowing what the tax does to your model on your task. It is easy to reproduce, and doing it once builds the instinct for when to reach for a reasoning field. The script below runs the same 200 grade-school math problems from GSM8K three ways and reports exact-match accuracy on the integer answer: free-form chain of thought (the honest baseline), strict answer-only JSON (the naive constraint), and reasoning-first JSON (the fix). It is fully self-contained.

python
# pip install "vllm>=0.19.0" "datasets>=3.0" "pydantic>=2.0"
import re
from datasets import load_dataset
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

MODEL = "Qwen/Qwen3.6-27B"   # swap the 9B dense variant to fit a 24GB card; the gap widens on smaller models
N = 200                       # more problems tighten the confidence interval (see below)

# Public benchmark: GSM8K grade-school math, test split. Gold answer follows "####".
gsm8k = load_dataset("openai/gsm8k", "main", split=f"test[:{N}]")
questions = [row["question"] for row in gsm8k]
gold = [int(row["answer"].split("####")[-1].strip().replace(",", "")) for row in gsm8k]

llm = LLM(model=MODEL, max_model_len=4096)

def params(guided=None):
    return SamplingParams(temperature=0.0, max_tokens=1024, guided_decoding=guided)

def last_int(text):
    nums = re.findall(r"-?\d[\d,]*", text)
    return int(nums[-1].replace(",", "")) if nums else None

def acc(preds):
    return sum(p == g for p, g in zip(preds, gold)) / len(gold)

# A. Free-form CoT, then parse the final number (the honest baseline)
free = llm.generate([q + "\n\nReason step by step, then state the final integer." for q in questions], params())
a_preds = [last_int(o.outputs[0].text) for o in free]

# B. Strict JSON, answer only: the naive constraint that removes the scratch work
class AnswerOnly(BaseModel):
    answer: int

b = llm.generate([q + "\n\nReturn JSON." for q in questions],
                 params(GuidedDecodingParams(json=AnswerOnly.model_json_schema())))
b_preds = [AnswerOnly.model_validate_json(o.outputs[0].text).answer for o in b]

# C. Reasoning-first JSON: a free-text field generated BEFORE the answer (the fix)
class ReasonThenAnswer(BaseModel):
    reasoning: str
    answer: int

c = llm.generate([q + "\n\nReturn JSON." for q in questions],
                 params(GuidedDecodingParams(json=ReasonThenAnswer.model_json_schema())))
c_preds = [ReasonThenAnswer.model_validate_json(o.outputs[0].text).answer for o in c]

print(f"A  free-form CoT      : {acc(a_preds):.1%}")
print(f"B  JSON, answer-only  : {acc(b_preds):.1%}")
print(f"C  JSON, reason-first : {acc(c_preds):.1%}")

A representative run (Qwen3.6-27B in non-thinking mode, N=200, greedy) prints:

A  free-form CoT      : 88.0%
B  JSON, answer-only  : 73.5%
C  JSON, reason-first : 87.0%

Read the shape, not the decimals. Condition B sits well below A: forcing the answer out with no room to reason is the reasoning tax, the same double-digit drop Tam et al. reported, now on your model and your data. Condition C recovers almost all of it by giving the model a free-text field to think in before the schema commits it to an integer, and it does so while the output stays a valid JSON object by construction (B and C never fail to parse). That is the whole argument made runnable: the structure guarantee is orthogonal to accuracy, and where you put the reasoning is what closes the gap.

Two honest caveats before trusting any single run. First, run the model in non-thinking mode for this comparison (enable_thinking=False in the Qwen chat template). If a separate thinking phase runs before the constrained response, it hides the tax, because the reasoning already happened outside the schema. That is exactly the production fix, but the wrong setup for observing the effect. Second, N=200 is a small sample: the 95% confidence interval on each number is roughly plus or minus six to seven points, so a two-point difference is noise and only the double-digit A-versus-B gap is real. This is the evaluation crisis discipline applied to your own harness: report the interval, not just the point, and raise N until the comparison you care about clears it.

Schema-valid is not the same as correct

One more trap, because it follows directly from the above. Constrained decoding guarantees the shape of the output, never its content. A model can emit a perfectly valid Ticket with the wrong priority, a hallucinated category, or a summary that misreads the email. The JSON parses; it is still wrong.

This is the same conflation the evaluation crisis warns about: a green checkmark (it validates) is not a result (it is right). Constrained decoding removes parsing failures from your error budget, which is genuinely valuable, but it does nothing for semantic errors, and it can hide them behind a reassuring structure. You still need an eval that checks values, not just shape, especially when the structured output feeds an agent or tool call that will act on it.

Gotchas worth knowing

A few sharp edges separate a structured-output setup that works from one that mysteriously stalls.

  • Token alignment. The schema is defined over characters, but the model generates tokens, and a legal continuation sometimes needs a token that straddles a structural boundary (a token that is a closing quote plus the start of the next key). Maintained engines like XGrammar handle this token-versus-byte alignment correctly; naive implementations can deadlock or force awkward tokenization that hurts quality. Use a current backend, not a hand-rolled masker.
  • Schema complexity has a cost. A deeply nested or very permissive schema compiles to a large automaton, which raises both startup compile time and per-token mask cost. Keep schemas as tight as the task needs. Tight schemas are also more reliable, so this aligns with quality.
  • Constraints prevent invalid tokens, not bad behavior. The model can still emit a valid-but-empty object, repeat a field, or loop until it hits max_tokens. Set a sane token limit and validate the result; the grammar guarantees shape, not sanity.
  • Streaming buffers to completion. You can stream constrained output token by token, but a partial JSON object is not parseable until it closes, so any consumer that needs the parsed value must buffer to the end.

When to constrain, and when not to

The decision is per call, not per application:

  • Constrain when the output is consumed by a machine: tool and function calls, API responses, data extraction, pipeline stages, anything that gets parsed. Here a parse failure is an outage and constrained decoding is the right default.
  • Do not constrain the reasoning for hard analytical tasks. Let the model think in free text, then extract structure in a second step or behind a reasoning-first field. Wrapping a math or multi-step problem in JSON-mode is how you ship a model that validates and underperforms.
  • Pick the backend for your traffic: XGrammar for repeated grammars and low per-token cost, guidance for dynamic schemas and fast first token, and benchmark at your real concurrency.

Constrained decoding is one of the most useful tools in the production LLM kit and one of the easiest to misuse. Get guaranteed structure where structure is consumed, keep reasoning free where reasoning matters, and never mistake a valid object for a correct one.

Putting it together: reason, constrain, validate

The production-ready pattern combines all three lessons. Reason first inside the structure, let the decoder guarantee the shape, then validate the values yourself:

python
from pydantic import BaseModel, field_validator

class Triage(BaseModel):
    reasoning: str          # free-text first, so the model thinks before committing
    category: str
    priority: int

    @field_validator("priority")
    @classmethod
    def _in_range(cls, v):
        if not 1 <= v <= 5:
            raise ValueError("priority must be 1-5")
        return v

raw = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[{"role": "user", "content": prompt}],
    extra_body={"guided_json": Triage.model_json_schema()},
).choices[0].message.content

triage = Triage.model_validate_json(raw)   # parsing cannot fail; validation can, and should

Constrained decoding reliably enforces the structure and types, so the parse step cannot fail. Finer constraints like numeric ranges, and the actual correctness of the values, are yours to check, which is what the validator is for. Structure from the decoder, correctness from your own checks: that division of labor is the entire discipline of reliable structured output.

Key Takeaways

  1. Prompting for JSON is a request; constrained decoding is a guarantee. Masking invalid tokens at every step makes malformed output impossible by construction, removing parse failures from your error budget.
  2. The mechanism is token masking driven by a state machine. An FSM handles regex and choices, a pushdown automaton handles full grammars like JSON. Tools (Outlines, XGrammar, guidance) differ mainly in how cheaply they compute the mask.
  3. In vLLM, declare a Pydantic schema and pass it as guided_json. The same model works offline via GuidedDecodingParams or over the server via extra_body; hosted APIs expose it as a response_format JSON schema.
  4. The latency tax is now small. XGrammar-class engines give near-zero per-token overhead, especially with grammar reuse, but watch for throughput drops at batch 8+ in some paths and measure at your real batch size.
  5. JSON-mode has a reasoning tax. Format restrictions degraded reasoning 10 to 15% in controlled tests, because constraining the format removes the intermediate thinking tokens the model needs. Reproduce it on GSM8K with the script above: answer-only JSON drops well below free-form, and a reasoning-first field recovers almost all of it.
  6. Separate thinking from formatting. Put a free-text reasoning field before the answer field, or reason in an unconstrained call and extract structure in a cheap second call. Never wrap the reasoning itself in the schema.
  7. Schema-valid is not correct. Constrained decoding guarantees shape, not content. A valid object can still have wrong values, so you still need an eval that checks meaning, not just that it parses.
  8. Decide per call. Constrain machine-consumed outputs; keep reasoning free for hard tasks; choose the backend (XGrammar vs guidance) by whether your grammars repeat.

The Acing AI newsletter covers the production LLM stack the way this guide covers constrained decoding: the feature, then the caveat the docs leave out. Subscribe for the grounded version.