Pose the question that haunts every large model: attention lets every token look at every other token, which is powerful but scales quadratically, double the input, quadruple the cost. So how do you keep the global view without the global bill? NVIDIA's US20240185034A1 (published June 6, 2024) answers with hierarchy.

Here's the mechanism. Instead of every token attending to all others directly, the model attends finely within local neighborhoods and coarsely across summarized chunks. Think of reading a book: you read sentences word-by-word but track the plot chapter-by-chapter, not word-by-word across the whole book. Hierarchical attention gives the model that two-level view.

Under the hood, the global level operates on compressed representations, summaries of regions rather than every element, so the expensive long-range comparisons happen on far fewer items. The local level keeps the detail where it matters. The CPC tags G06N 3/0455 and 3/0464 place this in transformer-architecture territory.

Why a general reader should care: the length of context a model can handle, how much it can read at once, is gated by exactly this attention cost. Every trick that tames the quadratic blowup is a trick that lets models handle longer documents, bigger codebases, more history. NVIDIA filing on it reflects that the company optimizes models, not just the chips they run on.

House caveat: a publication describes an architecture, and hierarchical schemes trade some precision for scalability, the summary can miss something the full attention would catch. As a dated marker, though, it's clean: by mid-2024, making attention scale via hierarchy was core, named NVIDIA IP, squarely on the long-context problem.