Diffusion language models arrive with a tidy marketing claim: unlike the autoregressive models that write one token after another, left to right, these are parallel, non-autoregressive decoders that fill in a whole canvas of tokens at once. It is an appealing pitch, especially for speed. There is just one problem, and a new paper by Ali Asaria, Tony Salomone, and Deep Gandhi, posted to arXiv on June 12, 2026, names it directly: the order in which a shipped checkpoint actually commits its tokens is almost never measured. Everyone repeats the brochure. Almost nobody checks the receipts. So the authors checked.
Wiring up the model to confess
The object of study is DiffusionGemma 26B, a masked discrete-diffusion mixture-of-experts model built on Gemma 4. Masked discrete diffusion works, roughly, by starting from a canvas of masked positions and iteratively "accepting" tokens — filling in positions over a series of steps until the text is complete. The authors instrument this process at the source: they hook the sampler's accept step to record, for every generation, which canvas positions commit, when in the process they commit, and at what confidence. That gives them a frame-by-frame transcript of how the text actually crystallizes, rather than an inference from the final output.
They run this across a sizable probe suite — 686 prompts spanning six different regimes — so the conclusions are not pinned to one lucky prompt or one task type. And the conclusions are pointedly deflationary.
Neither of the two stories is true
The two clean narratives on offer are "parallel" (everything resolves at once, order doesn't exist) and "block-autoregressive" (the model decodes in left-to-right blocks of some fixed size). DiffusionGemma fits neither. What the authors find is a partial left-to-right commit bias — a lean toward finishing earlier positions first — but with a crucial twist: the apparent strength of that bias depends almost entirely on the granularity at which you look. Examine the order token by token and it is weak. Coarsen the analysis, grouping positions into bigger chunks, and the order strengthens smoothly, until it looks like the model is marching through tidy blocks.
The kicker is that this block structure is not in the model. As the authors put it, the model's "block size" turns out to be an artifact of the measuring ruler rather than the architecture. There is no built-in block; there is a smooth, weak directional tendency that you can make look like blocks of any size simply by choosing how coarsely to measure. That is precisely the kind of result a skeptic loves: a widely repeated structural claim dissolving into a measurement choice.
Part of why order is so slippery here is that the model commits in large simultaneous batches. When many positions accept at the same step, the relative order within that batch is genuinely undefined — not merely unobserved, but nonexistent as a fact about the model. There is no hidden true ordering inside a batch waiting to be revealed; the question simply has no answer. Treating those ties as if they encoded an order is one of the ways an analyst can hallucinate structure.
It depends on what you ask it to write
The behavior also shifts by task, which further undercuts any single tidy description. Structured JSON, the authors report, is committed in essentially arbitrary order — the rigid syntax apparently lets the model fill slots without much directional preference. And the model's per-position commit confidence is not a uniform signal: on mathematical reasoning, a position's commit confidence tracks correctness, so confidence carries real information there; but on factual recall it carries no such signal. Confidence, in other words, means something on one kind of task and nothing on another. Meanwhile, commitment is aggressive — the model finishes in a short late burst well inside its step budget rather than easing to a stop — and, importantly, its task accuracy matches its autoregressive Gemma-4 sibling. The exotic decoding dynamics do not buy or cost accuracy versus the conventional model.
The real contribution is the honesty
The findings are interesting, but the authors are explicit that their central contribution is methodological: measuring decoding order honestly is harder than it looks, and naive analysis manufactures results that are not really there. They enumerate the traps. Trailing end-of-sequence padding can masquerade as committed content. Different regimes confound each other if pooled carelessly. Commitment is not monotonic — positions can behave non-monotonically through the process. Block-size sensitivity, as shown, lets you conjure any block structure you like. And large commit-batch ties, mishandled, invent an order out of simultaneity. Each of these, left unaddressed, can produce a confident, publishable-looking decoding-order claim that evaporates under scrutiny.
This is the part worth dwelling on. The paper is, at heart, a warning about a whole class of claims — about how diffusion language models decode — that have been made loosely and propagated widely. By building the careful instrument first and then reporting that the tidy stories do not survive contact with it, the authors are doing the thing the field underinvests in: checking whether a popular characterization is an observation or an artifact.
The caveats are the appropriate ones. This is one checkpoint — DiffusionGemma 26B — and the authors do not claim every diffusion model commits tokens the same way; other architectures and training recipes could behave differently. The six-regime suite is broad but not exhaustive. But the meta-lesson travels regardless of the specific model: before you repeat that a diffusion model is "parallel" or "decodes in blocks," ask who measured it, and whether they handled the ties. On this evidence, the honest answer about how DiffusionGemma commits tokens is that it is weakly biased, batch-heavy, regime-dependent, and far less tidy than the label on the box. Strip the marketing and that is what you get.