Connect three data points and you'll see why inference cost became the conversation. Point one: efficiency patents are clustering on inference. Microsoft's perplexity-routing grant US12547872B2 (2026) decides how much compute an input deserves; Google's mixture-of-experts family (US12518135B2, 2026) activates only part of the model per token. Both target the per-query cost.
Point two: the filings show capacity scaling without limit. Microsoft ties its datacenter network to "cloud computing and AI infrastructure" demand (Microsoft Form 10-K, FY2024); Alphabet describes growing investment in servers and datacenters (Alphabet Form 10-K, FY2025, filed 2026-02-05). That capacity isn't mostly for training the next model — it's for serving the current ones to users, over and over.
Point three: the supplier agrees. NVIDIA lists "training and inference" together as its workloads (NVIDIA Form 10-K, FY2026), and inference is the half that grows with adoption rather than topping out after a training run.
These three are the same story. Training is a capability bet you make once per model; inference is the meter that runs every time anyone uses it. As AI products get adopted, inference volume — not training — becomes the cost that dominates, and the cost that compounds. That's why a perplexity-routing patent and a datacenter-capex disclosure are about the same thing: shaving the per-query cost of a workload that never stops.
The reason this matters to anyone outside the engineering teams: a model can be a research triumph and a financial liability if each query costs more than it earns. The whole efficiency-patent wave is the industry trying to fix that, and the capex line in every hyperscaler filing is the bill for not having fixed it yet. Follow both the IP and the money, and inference is where they meet.