What a Mixture-of-Experts Model Does | NeuralDocket

The architecture behind today's largest models is older and simpler than the hype suggests. Google's own grant explains the mechanism.

Forget the name for a second — here's the mechanism. A mixture-of-experts model is a single large network carved into many smaller sub-networks, each called an "expert." In front of them sits a tiny gating network whose only job is to look at an incoming token and decide which one or two experts should handle it. Everything else stays dormant. The model is enormous in total parameter count but cheap to run on any single token, because most of it is switched off at any given moment.

That trade — huge capacity, sparse activation — is not a 2024 invention. The foundational description sits in Google's granted patent US10719761B2, "Mixture of experts neural networks," issued in 2020 to inventors including Noam Shazeer and Azalia Mirhoseini. The grant places the MoE block precisely: between two ordinary layers of a larger network.

A system includes a neural network that includes a Mixture of Experts (MoE) subnetwork between a first neural network layer and a second neural network layer. The MoE subnetwork includes multiple expert neural networks.— U.S. Patent No. 10,719,761 source

How the gate actually picks experts

The claims spell out the routing step that the word "gating" hides, and it is more than a simple switch. For each input, the gating subsystem first "generate[s] an initial gating output by applying a set of gating parameters to the first layer output." It then does three things in sequence: applies a "sparsifying function" that keeps only the k highest values and pushes the rest to a value that maps to zero; applies a softmax to turn those survivors into weights; and selects only the experts with non-zero weight. The chosen experts each process the input, and their outputs are combined "in accordance with the weights" — weighted and summed — before being handed to the next layer. So routing is not "pick one"; it is "score every expert, keep the top few, blend their answers by confidence."

Two details in the grant are easy to miss and matter in practice. First, the gate deliberately adds "tunable Gaussian noise" to its scores before sparsifying — trainable noise parameters multiplied by random samples — which nudges the model to spread load across experts during training instead of collapsing onto a favorite few. Second, the claims describe a hierarchical option: "a parent gating subnetwork and a plurality of child gating subnetworks," where each child "manages a disjoint subset of the plurality of expert neural networks." That is how you route among very large numbers of experts without one gate having to score all of them at once.

Two consequences follow from the design. First, you can grow total capacity almost for free, because adding experts doesn't add per-token compute — the sparsifying step still only lights up the top k. Second, you inherit a load-balancing problem: if the gate keeps favoring the same experts, the rest never learn, which is exactly the failure the tunable noise is there to fight. Google's continuation grants — US12067476B2 (2024) and US12518135B2 (2026), the latter describing a sparse and differentiable variant — track the family's evolution toward routing that can be trained end to end.

The hospital, not the generalist

Here's the analogy, then I'll drop it: a MoE model is a hospital, not a single overworked generalist. The gating network is triage at the door, sending each patient to the cardiologist or the dermatologist rather than making one doctor learn everything. The "top-k" rule is triage referring you to at most a couple of specialists, not the whole staff. The hospital can hire more specialists without slowing down any single visit. But triage has to be good, and — per the load-balancing wrinkle — you can't let the cardiology wing sit idle while everyone queues for dermatology, which is why the gate is trained with noise to keep the whole staff busy.

Why does this matter beyond architecture trivia? Because MoE is one of the few levers that lets model builders chase capability without a linear increase in inference cost — and inference cost is the line item that shows up in hyperscaler filings as data-center capex. The technique on the patent page and the spending in the 10-K are two ends of the same story. Google's parent Alphabet, in its most recent annual report, describes ongoing investment in technical infrastructure including servers and data centers to support that growth (Alphabet Form 10-K, filed 2026-02-05).

The grant is also explicit that the experts are interchangeable in form but not in content: the claims note the expert networks "have the same or similar architectures but different parameter values." That is the architectural reason MoE scales so cleanly — you are not designing a new kind of sub-network for each expert, you are stamping out copies of the same block and letting training drive them to specialize. The gate's job, then, is purely to learn which copy is best for which input, and the "sparse vector" of weights it produces — "non-zero weights for only a few of the expert neural networks" — is the literal mechanism by which most of the model stays dark on any given token.

So when a vendor announces a model with a trillion parameters but quotes a modest serving cost, the explanation is usually some flavor of the mechanism in these grants: most of those parameters are asleep on any given token, because a noisy top-k gate woke up only a handful of experts and blended their answers. The number that sounds impressive and the number that sounds affordable are reconciled by routing — and routing, in exactly this form, is what the patents actually claim.

What a Mixture-of-Experts Model Actually Does — Read Through Google's Own Patent

How the gate actually picks experts

The hospital, not the generalist

Comments