Pose the question every team serving a model hits: a GPU can do a lot of math in parallel, but a single user's request rarely uses all of it. So how do you avoid paying for an expensive chip that sits half-idle? You batch, bundle many requests so the hardware works at full width.
The grant US11442775B1 (issued September 13, 2022, to FriendliAI; a companion grant US11514370B1 issued that November) describes doing this dynamically for transformer generation. The CPC tags, G06F 9/4881 (scheduling), G06N 3/0454 and 3/08 (neural networks), capture the dual nature: it's a scheduling problem wearing a machine-learning hat.
“An inference system applies a machine-learning transformer model to a batch of requests with variable input length or variable target length or variable internal state length by selectively batching a subset of operations in the transformer model but processing requests in the batch individually for…”— U.S. Patent No. 11,442,775 source
Under the hood, transformer generation makes naive batching hard because each request finishes at a different time, one answer is ten tokens, another is a thousand. Static batches stall on the slowest member. Dynamic, or continuous, batching slots new requests in as old ones complete, keeping the hardware busy. That's the core insight, and it's the difference between a cheap and an expensive serving stack.
Why this is a sector-defining mechanism: inference cost is the number that decides whether an AI product has a viable margin. Dynamic batching can multiply throughput on the same hardware several-fold. When you read that a provider cut inference cost, batching strategy is very often the hidden lever.
The careful note: it's a granted patent, so it's an enforceable right with claims that set the real scope, but the broad idea of batching is old, the novelty is in the transformer-generation specifics. As a dated marker it's valuable: by late 2022, serving transformers economically was a distinct, patented engineering discipline, not an afterthought.