One of the most provocative ideas in mechanistic interpretability is superposition: the claim that neural networks pack more features, or even more computation, into their neurons than they have dimensions to store cleanly, by overlaying things in shared directions and tolerating a little interference. If true, it would explain why dissecting a network is so hard — there may simply be more going on inside than the neuron count suggests. A toy model from Braun et al. (2025), called Compressed Computation, became a striking exhibit for the stronger version of the claim. A new paper by Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani, and Stefan Heimersheim, posted to arXiv on June 12, 2026, takes that exhibit apart and argues, in its own careful words, that compressed computation is (probably) not computation in superposition.
The puzzle that made the toy model famous
The setup is deliberately minimal so the surprise is unambiguous. The Compressed Computation model appears to compute 100 separate ReLU functions using just 50 neurons — and, more pointedly, it achieves a better loss than you would expect if it were only really representing 50 of those functions and giving up on the rest. That gap is the whole story. If 50 neurons can only honestly do 50 ReLUs, where does the extra performance come from? The tempting answer is computation in superposition: the network is somehow squeezing real computation for all 100 functions into a 50-dimensional space, exactly the phenomenon interpretability researchers have been hunting for. As a clean, reproducible demonstration, it was an attractive piece of evidence.
Following the extra performance to its source
The new paper's method is to stop admiring the gap and start auditing it. The authors find that the model mixes its inputs through its noisy residual stream — the running sum of vectors that flows through the network — and that this mixing corresponds to an unintended mixing matrix sitting in the labels the model is trained against. In plain terms: because the inputs bleed into each other via noise, the target the model is actually fitting is not 100 clean independent ReLUs but a version of the task that has accidental cross-talk baked in. The "extra" computation may not be extra computation at all; it may be the model exploiting structure that was unintentionally introduced into the problem itself.
To test this, the authors do something an explainer can appreciate: they split the training objective into two pieces. One piece is the genuine ReLU term — the actual functions you wanted computed. The other is the mixing term — the contribution of that unintended cross-talk. With the objective decomposed, the diagnostic becomes clean. They find that the performance gains scale with the magnitude of the mixing matrix, and, decisively, the gains vanish when the matrix is removed. Take away the accidental mixing and the surprising efficiency goes with it. That is close to a smoking gun: if the effect that made the model interesting evaporates once you delete the mixing term, then the mixing term, not superposition, is what was doing the work.
Where the neurons actually point
The authors add a second strand of evidence about how the model organizes itself. They examine the directions the learned neurons settle into and find that those directions concentrate in the subspace associated with the top 50 eigenvalues of the mixing matrix. In other words, the model's internal geometry lines up with the structure of the cross-talk, not with an even spread across all 100 functions. The mixing term, they conclude, governs the solution — it is steering where the network puts its limited capacity. A network genuinely doing computation in superposition would be expected to distribute itself differently; this one tracks the mixing matrix.
Finally, the authors build a deliberately simple, non-neural baseline to see how much of the phenomenon a mechanical procedure can recover. Using a semi-non-negative matrix factorization — call it SNMF — derived solely from the mixing matrix, they reproduce the qualitative shape of the loss profile and even improve on prior baselines. They are careful and honest about the limit: this SNMF baseline does not fully match the trained model, so it is not the claim that a matrix factorization is the network. But the fact that a procedure built only from the mixing matrix captures the qualitative behavior is strong corroboration that the mixing matrix is the real protagonist.
Why this matters beyond one toy
It is tempting to shrug at a debate over a 50-neuron toy, but the stakes are larger than the model. Toy models are how a field calibrates its intuitions; an interpretability community that points to Compressed Computation as a canonical example of computation in superposition is, in part, deciding what superposition means and where to look for it in real networks. If the canonical example turns out to be an artifact of unintended input mixing rather than genuine overlaid computation, then conclusions and follow-on work resting on it inherit that flaw. This is the unglamorous, essential maintenance work of a young science: re-examining the load-bearing demonstrations before the field builds too much on top of them.
The paper is appropriately measured about its own reach. The hedge in the title — "(probably)" — is doing real work, and the authors flag that their SNMF baseline does not perfectly reproduce the trained model, leaving room for a residual that future analysis might explain differently. The result is best read not as "superposition isn't real" — it makes no such sweeping claim — but as "this particular toy is not good evidence for it," because its surprising efficiency traces cleanly to an accidental mixing term that, when removed, takes the surprise with it. For anyone trying to understand what a network is actually doing inside, that distinction — between a real phenomenon and an artifact of how the task was set up — is exactly the kind of thing worth getting right early.