Pose the question most people are too polite to ask: if a model has hundreds of billions of parameters, why doesn't it cost hundreds of billions of parameters' worth of math to answer a single question? For models built as a mixture of experts, the answer is that they don't run the whole network for every input. They run a small slice of it, chosen on the fly. Forget the name for a second — here is the mechanism. The network is broken into many parallel sub-networks, each called an expert. Sitting in front of them is a small, trainable component called a gating network. For each piece of input — in a language model, roughly each token — the gating network looks at that input and picks a handful of experts to actually do the work. The rest of the experts sit idle for that token. The output is a weighted combination of just the experts that were selected.
The phrase for this is conditional computation: which parts of the network are active depends on the example being processed. The idea is old, but making it work at scale was not. The reference point is the 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer and colleagues, which introduced the sparsely-gated MoE layer that modern systems descend from. Its central claim is precisely the capacity-without-cost trade the technique promises.
Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation.— Outrageously Large Neural Networks (arXiv:1701.06538), source
Two numbers from that paper make the scale concrete. The authors describe a sparsely-gated MoE layer "consisting of up to thousands of feed-forward sub-networks," and report model architectures in which an MoE with up to 137 billion parameters is applied between stacked recurrent layers — while achieving, in their words, "greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters." The point is the gap between two quantities that people routinely conflate: total parameters (how much the model can hold) and active parameters per input (how much compute it spends to process one example). A dense network ties those together — every parameter participates in every forward pass. An MoE deliberately pries them apart.
The gating network is where the real work hides
The experts are the easy part to picture: ordinary feed-forward sub-networks, each free to specialize during training on whatever patterns it ends up being routed. The subtle component is the gate. The paper describes a trainable gating network that determines a sparse combination of experts to use for each example — meaning it outputs, for every input, a short list of selected experts and the weights to mix their outputs. "Sparse" is load-bearing here: the gate is trained to pick only a small number, so that the cost of a forward pass tracks the few experts chosen, not the thousands available. Because the gate is trained jointly with everything else, the routing is learned rather than hand-designed — the model discovers which experts to send which kinds of input to. Practical MoE systems add machinery to keep the gate from collapsing onto a few favorite experts and to balance load across them, but the core object is just this: a small learned router in front of a large pool of specialists.
This is also why MoE invites a common misreading. A headline parameter count — "a trillion-parameter model" — describes total capacity, the size of the whole pool of experts. It does not tell you how many parameters actually fire per token, which is the number that drives the compute and the cost of serving one query. Two models with the same total parameter count can have very different active footprints depending on how many experts the gate selects. When you see a large model that runs more cheaply than its size implies, sparsity in the routing is usually part of the reason.
Why this shows up in business filings, not just research
The reason a mechanism paper from 2017 still matters commercially is that inference — serving a trained model to users — is a recurring operating cost that scales with usage, and it is paid in compute. A technique that lets total capacity grow while holding active-compute-per-token roughly flat is, read through an income statement, a technique that bends the cost curve of running AI. That is why expert-routing methods keep appearing in the patent estates and engineering disclosures of the companies that operate models at scale: the same hyperscalers whose capital-expenditure lines have climbed steeply to build AI capacity have a direct interest in spending that capacity efficiently per query. The research mechanism and the business pressure point at the same place.
One honest caveat closes the loop. "Mixture of experts" describes a family of designs, not a single fixed recipe; systems differ in how many experts they hold, how many the gate selects per token, and how they balance the load — and those choices change the capacity-versus-cost math in ways the headline parameter count never reveals. What is durable across all of them is the idea the 2017 paper realized at scale: route each input to a few specialists rather than running the whole network, and you can make a model far larger in what it knows without making it proportionally more expensive to ask. That decoupling — capacity up, compute-per-example held in check — is the entire point, and it is why the technique remains foundational to how the largest models are built and served.
Comments
Loading comments…