How Lossless Tiling Works on Reconfigurable AI Chips

A published application describes how to carve a convolution's input tensor into memory-sized tiles and rebuild the result exactly, by computing the overlap each tile must carry from its neighbors. It sits inside a wider cluster of reconfigurable-dataflow filings published the same week.

Anyone who has run a large convolutional layer on an accelerator has hit the same wall: the input tensor is bigger than the fast on-chip memory you want to keep it in. The standard answer is to cut the tensor into tiles, process each one, and stitch the outputs back together. The standard problem is that a convolution does not respect tile boundaries. A kernel sitting near the edge of a tile needs to read pixels that fall into the next tile over. Split the tensor naively and the values along every seam come out wrong.

A patent application published on June 18, 2026 and assigned to SambaNova Systems, Inc. is directed at exactly this seam problem. Titled Tile Generation and Overlap Calculation for Lossless Tiling in Convolution Networks, the application describes runtime logic that takes an input tensor plus the convolution's parameters — kernel size, stride, and padding — and works out how the tensor can be divided so that the tiled computation reproduces the result of the untiled one. The operative word in the title is lossless: the goal is not an approximation that is close enough, but a tiling that is mathematically equivalent to never having tiled at all.

The mechanism: compute the overlap, then route it

The way this actually works, per the disclosure, is to treat the overlap as something you calculate rather than guess. The runtime logic first determines target tile dimensions from the input tensor's size and the available memory. It then calculates an overlap size between adjacent tiles directly from the kernel size, stride, and padding — the same three numbers that determine how far a kernel reaches beyond any given output position. That is the key move: the amount a tile has to borrow from its neighbor is not arbitrary, it is a deterministic function of the convolution's own geometry, so it can be derived up front instead of discovered at the boundary.

From there the method generates a tiling configuration that specifies the boundaries of each tile, and for every tile it identifies the neighboring tiles and the overlapping memory regions shared with them. The application describes calculating memory addresses for both the overlapping regions and the remaining non-overlapping regions of each tile. In plain terms: each tile is told precisely which bytes it shares with its neighbors and where those bytes live, so the halo of data a kernel needs at the edge is already accounted for in the address map rather than fetched ad hoc.

Two consequences fall out of that design, both described in the record. First, because the overlap is computed from the parameters, the reassembled output matches the non-tiled convolution — the seams disappear. Second, because the overlapping and non-overlapping regions are mapped explicitly, the system can process the tiles "while minimizing redundant memory operations," in the application's phrasing. Tiles that share a halo can reference the same memory rather than each pulling its own copy.

For each tile, the runtime logic identifies neighboring tiles, determines overlapping memory regions with the neighboring tiles based on the overlap size and tile boundaries, and calculates memory addresses for the overlapping regions and remaining non-overlapping regions. The tiling configuration enables efficient processing of convolution operations with overlapping tiles while minimizing redundant memory operations.— Tile Generation and Overlap Calculation for Lossless Tiling in Convolution Networks, US20260170308A1

Where it sits in the field — and in the assignee's other filings

Tiling for convolution is not a new idea in the abstract; partitioning large tensors to fit a memory budget is a long-standing technique in deep-learning compilers and accelerator runtimes. What the application is directed to is the specific bookkeeping of overlap-aware, lossless tiling: deriving the halo from kernel, stride, and padding, and then expressing the shared regions as concrete memory addresses so adjacent tiles can co-operate without redundant fetches. The record classifies the filing under CPC G06N 3/0464 (convolutional neural-network architectures) alongside G06F 12/023 (memory allocation) and G06N 3/084 (backpropagation), which captures the dual nature of the disclosure: it is a neural-network method that is fundamentally about how memory is laid out.

The application reads more clearly when placed next to the rest of SambaNova's filings that surfaced the same week. The company's architecture is a coarse-grained reconfigurable dataflow design — arrays of compute and memory units configured per workload rather than a fixed instruction pipeline — and the tiling method is the kind of runtime concern that architecture creates. Published the same day is Optimized Data Routing in Reconfigurable Processors Using Primary and Secondary Port Differentiation (US20260169954A1), which describes prioritizing low-latency vector ports for compute clusters while using staging ports for data movement — another take on moving tensor data efficiently across the grid.

Two more from the same drop concern multi-tenancy: Resource Allocation for Virtual Functions in Multi-Die Reconfigurable Processors (US20260169713A1) and Resource Requirement Analysis and Configuration Selection for Virtual Functions in Reconfigurable Processors (US20260169712A1), both describing how arrays of reconfigurable units are allocated across dies to virtual functions according to compute and memory requirements. A fifth, Array of Compute Units That Are Reconfigurable for Separation into Mutually Exclusive Groups (US20260169540A1), describes partitioning the compute array into synchronized, mutually exclusive groups. Read together, the cluster describes the same chip from several angles: how to slice the data (tiling), how to move it (port differentiation), and how to slice the silicon itself among workloads (virtual functions and group separation). The thread connecting them is partitioning under constraint. Slightly earlier filings extend it further — Multiple Segments for a Memory Unit in a Reconfigurable Data Processor (US20260161444A1) describes a multi-segment datapath between memory and compute units, and Systems and Methods for Area Efficient Multi-Precision Dot Product Determination (US20260161357A1) describes multiplier circuitry that handles both BF16 and FP8 operands in the same unit. None of these is a granted patent; each is a published application, disclosed but not yet examined to issue. Taken as a set, they describe a coherent engineering posture: when the tensor, the memory, and the silicon are all finite, the recurring task is to divide each one cleanly and keep the arithmetic exact across the cuts.

Splitting a Tensor Without Breaking the Math: A New Tiling Method for Dataflow AI Chips

The mechanism: compute the overlap, then route it

Where it sits in the field — and in the assignee's other filings

Comments