Forget the name for a second — here's the mechanism. A mixture-of-experts model is a single large network carved into many smaller sub-networks, each called an "expert." In front of them sits a tiny gating network whose only job is to look at an incoming token and decide which one or two experts should handle it. Everything else stays dormant. The model is enormous in total parameter count but cheap to run on any single token, because most of it is switched off at any given moment.

That trade — huge capacity, sparse activation — is not a 2024 invention. The foundational description sits in Google's granted patent US10719761B2, "Mixture of experts neural networks," issued in 2020 to inventors including Noam Shazeer and Azalia Mirhoseini. The grant describes a gating subsystem that selects a subset of expert networks for each input and combines their outputs — the precise pattern that later showed up, scaled, in frontier language models.

The way this actually works in practice is a routing decision. For each token, the gating network produces a score per expert and keeps only the top few. Two consequences follow. First, you can grow total capacity almost for free, because adding experts doesn't add per-token compute. Second, you inherit a load-balancing problem: if the gate keeps favoring the same experts, the rest never learn. Google's continuation grants — US12067476B2 (2024) and US12518135B2 (2026), the latter describing a sparse and differentiable variant — track the family's evolution toward routing that can be trained end to end.

Here's the analogy, then I'll drop it: a MoE model is a hospital, not a single overworked generalist. The gating network is triage at the door, sending each patient to the cardiologist or the dermatologist rather than making one doctor learn everything. The hospital can hire more specialists without slowing down any single visit. But triage has to be good, and you can't let the cardiology wing sit empty while everyone queues for dermatology.

Why does this matter beyond architecture trivia? Because MoE is one of the few levers that lets model builders chase capability without a linear increase in inference cost — and inference cost is the line item that shows up in hyperscaler filings as data-center capex. The technique on the patent page and the spending in the 10-K are two ends of the same story. Google's parent Alphabet, in its most recent annual report, describes ongoing investment in technical infrastructure including servers and data centers to support that growth (Alphabet Form 10-K, filed 2026-02-05).

So when a vendor announces a model with a trillion parameters but quotes a modest serving cost, the explanation is usually some flavor of the mechanism in these grants: most of those parameters are asleep on any given token. The number that sounds impressive and the number that sounds affordable are reconciled by routing — and routing is what the patents actually claim.