Hierarchical Self-Attention Explained | NeuralDocket

A June 2024 NVIDIA-team publication describes global hierarchical self-attention. How a model attends to everything without paying for everything.

Pose the question that haunts every large model: attention lets every token look at every other token, which is powerful but scales quadratically, double the input, quadruple the cost. So how do you keep the global view without the global bill? NVIDIA's US20240185034A1 (published June 6, 2024) answers with hierarchy.

Here's the mechanism. Instead of every token attending to all others directly, the model attends finely within local neighborhoods and coarsely across summarized chunks. Think of reading a book: you read sentences word-by-word but track the plot chapter-by-chapter, not word-by-word across the whole book. Hierarchical attention gives the model that two-level view.

“Apparatuses, systems, and techniques of using one or more machine learning processes (e.g., neural network(s)) to process data (e.g., using hierarchical self-attention).”— U.S. Patent Application 2024/0185034 A1 source

The abstract names the trick that makes the hierarchy work: “carrier tokens.” Image data is classified using hierarchical self-attention “generated using carrier tokens that are associated with windowed subregions of the image data, and local attention generated using local tokens within the windowed subregions and the carrier tokens.” A carrier token is a compact stand-in for a whole subregion, a summary that travels up to the global level on behalf of its neighborhood. Local tokens attend to each other and to their carrier token inside a window; the carrier tokens then attend to each other across windows. Long-range information moves between regions via these summaries instead of via every pixel talking to every other pixel.

The claims operationalize this as relation values between summaries. The method claim generates first and second subregions of an image, computes a “first metric” from the values in the first subregion and a “second metric” representing the second subregion, then processes the image using those metrics, the metrics being the carrier-token summaries. Dependent claims specify that the metric is, at least in part, an average of the subregion's values, and that the system determines relation values both between a subregion's tokens and its own metric (local) and between the two subregions' metrics (global). Tellingly, several claims state the first relation value is computed “using at least one convolutional neural network and at least one transformer neural network,” with the data “downsampled” after the convolution before it reaches the transformer, an explicit conv-then-transformer hybrid, with downsampling as the cost-control step.

Under the hood, then, the global level operates on these averaged, downsampled summaries, so the expensive long-range comparisons happen on far fewer items, one carrier token per region rather than every element. The local level keeps the detail where it matters. A two-hidden-layer claim even describes stacking the idea: a first layer computes relation values among local values, the metric, and the neighbor's metric; a second layer computes relation values on top of those, building the hierarchy in depth as well as breadth. The CPC tags G06N 3/0455 and 3/0464 place this in transformer-architecture territory.

Why a general reader should care: the length of context a model can handle, how much it can read at once, is gated by exactly this attention cost. Every trick that tames the quadratic blowup is a trick that lets models handle longer documents, bigger codebases, more history, or in this image setting, higher-resolution pictures. NVIDIA filing on it reflects that the company optimizes models, not just the chips they run on, and that the optimization here is concrete: replace all-pairs attention with carrier-token summaries plus a conv-transformer front end.

The conv-then-transformer detail in the dependent claims is more than an implementation note; it is the cost model made explicit. Several claims specify that the first relation value is computed “using at least one convolutional neural network and at least one transformer neural network,” and that the convolutional output is “downsampled” before it reaches the transformer. Convolutions are cheap and local, ideal for the fine-grained work inside a window; the transformer, which carries the expensive all-pairs comparisons, is then handed a smaller, downsampled set of features. By the time the quadratic-cost attention runs, it is running over carrier tokens, summaries, not raw pixels. The architecture front-loads convolution to shrink what attention must chew on.

The stacked, two-hidden-layer claim shows the hierarchy compounding in depth. A first hidden layer computes relation values among a region's local values, that region's metric, and a neighboring region's metric; a second hidden layer computes relation values on top of those. Each layer raises the level of abstraction, neighborhoods of neighborhoods, so a deep stack can relate very distant parts of an image through a short chain of summaries rather than a single enormous all-pairs comparison. For an image classifier that is elegant; for the long-context language problem it gestures at, it is the same trick that lets a model track a whole document through a hierarchy of compressed representations instead of attending to every token pair directly.

The breadth of the independent claims is also worth reading as strategy, not just engineering. Claim 1 is written at the level of “a processor, comprising one or more circuits” that process a feature map using values derived from subregions, language broad enough to cover the mechanism wherever it runs, while the method and system claims restate the same carrier-token/region-metric idea for image classification, detection, and segmentation. That spread, hardware claim plus method claim plus system claim, all circling the same averaged-summary-plus-local-detail structure, is how a hardware company protects an architectural idea across the stack it sells: the silicon, the model, and the deployed system. For a reader tracking who owns what in the long-context race, the filing is a marker that NVIDIA was claiming not just faster chips but the attention structure that makes long inputs affordable on them.

House caveat: a publication describes an architecture, not a shipped product, and hierarchical schemes trade some precision for scalability, the averaged carrier-token summary can miss something the full attention would catch. As a dated marker, though, it's clean: by mid-2024, making attention scale via hierarchy, with named mechanisms (carrier tokens, region-average metrics, downsampled conv-to-transformer hops), was core NVIDIA IP squarely on the long-context problem.

Why Attention Has to Be 'Hierarchical' to Scale — a 2024 NVIDIA Patent

Comments