Two phrases get used as if they were one thing — "AI compute" — when they name jobs that behave nothing alike. The cleanest way to hold them apart is by when the work happens and how often. Training is the work of building a model: feeding it a large dataset and repeatedly adjusting its internal parameters so it gets better at the task. It happens once (per model version), it is enormous, and when it finishes you have a fixed set of weights. Inference is the work of using that finished model: you hand it one input — a prompt, an image — and it produces one output. That job is far smaller than training, but it runs again for every single request the model serves. Forget the jargon for a second: training is writing the book; inference is the cost of printing and handing out each copy.
The training side has a vivid public data point in the 2017 transformer paper "Attention Is All You Need," which reported its result alongside the compute it took to get there — a reminder that training is measured in machine-time on many accelerators at once.
On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.— Attention Is All You Need (arXiv:1706.03762), source
That was a research model in 2017, and "3.5 days on eight GPUs" was already worth bragging about as a reduction in training cost. Frontier training runs since then are vastly larger. But the shape of the cost is the point: training is a concentrated, up-front expenditure that produces a fixed asset — the trained weights. Once that run finishes, training that version is done. Nothing about serving the model to a million users adds to its training bill.
Why inference is the cost that never stops
Inference has the opposite profile. The compute to answer one query is modest compared with a full training run, but it is incurred every time the model is used, so total inference cost scales with usage rather than with a one-time event. A model that becomes popular costs more to run precisely because it is popular — each new user, each additional query, adds inference compute. This is why "make the model cheaper to run" is a distinct engineering discipline from "make the model better," and why techniques that reduce compute-per-query — sparse expert routing, smaller distilled models, quantization, caching — get so much attention from the companies operating models at scale. Training cost is mostly behind you the day the model ships; inference cost is in front of you for as long as anyone uses it.
There is a useful intuition for why a single training run can dwarf an enormous number of inference calls and yet inference can still cost more in total. Training touches every parameter repeatedly across an entire dataset, for many passes — it is the model being shaped, gradient step by gradient step. A single inference is one forward pass: the data flows through the fixed weights once and an answer comes out, with no learning and no backward pass. So per event, training is far heavier. But there is exactly one training event per model version and there are as many inference events as there are uses. Multiply a small per-query cost by billions of queries and the lifetime inference bill can exceed the one-time training bill — which is why companies serving popular models treat inference efficiency as a primary cost lever, not a rounding error.
The two also stress different hardware characteristics. Training is throughput-hungry and tolerant of latency — you want to process as much data as possible, and it does not matter if any single batch takes a while. Inference, especially interactive inference, is latency-sensitive — a user is waiting — and often memory-bandwidth-bound, because generating each token requires moving the model's weights and intermediate state through the chip quickly. That is why discussions of inference keep returning to memory bandwidth, while discussions of training dwell on raw floating-point throughput and how many accelerators can be wired together. Same silicon vendor, two different pressure points.
The distinction is real enough to appear in filings
This is not just a teaching abstraction. The companies that build the hardware name both workloads explicitly. NVIDIA's annual report describes its data-center platforms as serving "compute-intensive workloads such as artificial intelligence, or AI, model training and inference, data analytics, scientific computing, and 3D graphics" — listing training and inference as separate workloads in the same sentence, because they are sold and optimized for differently. When a company's disclosures and product lines distinguish two things, it is usually because the two things have different economics, and these do: one is the capital event of building the model, the other is the operating cost of running it.
Hold onto the one durable contrast. Training is a large, mostly one-time job that produces a fixed model and behaves like a capital cost. Inference is a small job repeated for every request, so it behaves like an operating cost that grows with adoption. A model can be staggeringly expensive to train and still live or die, commercially, on how cheaply it can be served — because training is paid once and inference is paid forever. Anyone trying to reason about AI economics, hardware choices, or why a given model is priced the way it is should sort the question first into training or inference, because almost everything downstream depends on which of the two they are actually talking about.
Comments
Loading comments…