Dataflow AI Accelerators Explained | NeuralDocket

A September 2021 publication is about scaling performance on dataflow deep-learning accelerators — the architecture under a lot of AI silicon.

These two things are the same story: the shape of a neural network and the shape of the chip that runs it. A dataflow accelerator tries to make the silicon mirror the computation, weights stream one way, activations another, and the multiply-accumulate operations happen in a grid built for exactly that traffic.

US20210271960A1 (published September 2, 2021) addresses how to scale performance across such a dataflow architecture, tagged G06N 3/063, the class for neural-network hardware implementation. The question it tackles is the one every accelerator team faces: as you add more compute units, how do you keep them fed without the data movement becoming the bottleneck?

“Embodiments of the present disclosure are directed toward techniques and configurations enhancing the performance of hardware (HW) accelerators.”— U.S. Patent Application 2021/0271960 A1 source

The disclosure splits into two named scaling arrangements, and they are worth separating. The first is static MAC scaling: architectures for raising performance-per-watt and performance-per-area. Claim 1 describes the array, an integrated circuit with a grid of processing elements (PEs), where each PE holds several multiply-and-accumulate units (MACs) and a register file that is split into multiple instances, each with its own read and write ports, one RF instance per MAC. That splitting is the static trick: by giving every MAC its own private, ported slice of register file, the design lets all the MACs in a PE operate independently and simultaneously without fighting over a single memory's ports. Dependent claims add column buffers sized to the number of RF instances and a “time-space multiplexing scheme” to deliver exactly that many data units down each load path, so the feeding of the array is matched to its width.

The second is dynamic MAC scaling, and it is where the “dataflow” insight bites. The patent describes a “data sparsity level estimator” (DSLE) that activates or deactivates MACs based on how sparse the activations and weights are. Neural-network tensors are full of zeros; a zero times anything is zero, so computing it wastes energy. A claimed method receives sparsity information for input activations and weights, computes an “average combined sparsity,” and switches MACs on or off accordingly, activating them when combined sparsity is below a threshold and deactivating when it rises above. Further claims close a feedback loop: the estimator compares the expected number of cycles (from the combined density) against the actual cycles taken, and nudges the sparsity threshold up or down to track reality. The chip, in effect, tunes how many of its multipliers are running to match the emptiness of the data flowing through it. Zero-value compression on the inputs is claimed too.

Connect the dots to the sector. The reason 'data center' lines dominate NVIDIA's and AMD's filings is that the hard problem in AI is no longer just the math, it's moving the numbers, and not wasting energy multiplying zeros. A dataflow design wins when it minimizes how far data travels and how much dead work it does, because moving bits and toggling idle multipliers cost more energy than the useful compute. Performance scaling on these chips is mostly a memory-and-interconnect-and-sparsity story wearing a compute costume, and this filing puts all three levers, split register files, time-space multiplexed feeding, and sparsity-gated MAC activation, in one document.

Follow the IP and you see the whole industry converging on this insight around 2021: GPUs, custom ASICs, and startup accelerators all chasing higher utilization of their compute by smarter dataflow and aggressive sparsity exploitation. The publication is one data point in a dense cluster of hardware-architecture filings from that window.

The closed-loop threshold tuning is the most interesting disclosed mechanism, because it makes the chip self-correcting. The data sparsity level estimator does not just switch MACs on and off against a fixed cutoff; a chain of dependent claims has it measure the actual number of cycles a computation took, compute the expected number from the average combined density and the operation count, and compare the two. If the expected cycles fall short of the actual, the sparsity threshold is raised; if expected meets or exceeds actual, the threshold is lowered or held. The estimator even emits a “threshold adjustment feedback signal” carrying the computed amount to an operator that applies it. The accelerator is continuously calibrating how aggressively to gate its multipliers against how its predictions are panning out on the data actually flowing through.

The static side is just as concrete about feeding the array. Splitting each processing element's register file into per-MAC instances, each with its own read and write ports, removes the port contention that would otherwise serialize the MACs; the column buffers are sized to match the number of instances; and a time-space multiplexing scheme delivers exactly that many data units down each load path per cycle. Add zero-value compression on the inputs, and the picture is coherent: a grid that can run all its multipliers in parallel, fed by a load network sized to keep them busy, while a feedback-tuned sparsity gate switches off the ones that would only be multiplying zeros. Static scaling widens the pipe; dynamic scaling stops wasting what flows through it.

The caveat: a publication describes an architecture, not a shipping chip, and 'performance scaling' claims live or die on the specific workload, the sparsity gating only pays off if the model's tensors are actually sparse, and a closed-loop threshold can chase its own tail on irregular data. But the filing is a clean marker that by late 2021, dataflow scaling, splitting register files and dynamically matching active multipliers to data sparsity, not raw transistor count, was understood to be the lever for AI accelerator performance.

Why AI Chips Are Built Around 'Dataflow' — a 2021 Accelerator Patent

Comments