Gaze Heads: How a VLM Looks at What It Describes

Researchers found a small set of attention heads that track which image region a model is describing, and showed that nudging those heads forces the model to describe a different region entirely — no retraining required.

When you ask a modern AI to describe a photo, it produces a confident paragraph: a dog on a porch, a red door behind it, a bicycle leaning against the wall. What it almost never tells you is how it kept track of which object it was talking about as the sentence unfolded. A new study from Rohit Gandikota and David Bau, posted to arXiv on June 12, 2026, opens that black box and finds something strikingly concrete: vision-language models develop a small, identifiable set of attention heads whose job is to point at whatever the model is currently describing. The authors call them gaze heads, and the most surprising result is that you can grab the steering wheel.

The question nobody had cleanly answered

A vision-language model, or VLM, glues a vision encoder to a language model. The image gets chopped into hundreds of patch tokens, the language backbone reads those tokens alongside the text prompt, and out comes a description. The mechanics of attention — how each output word decides which input tokens to weight — are well studied in pure text models. But in a VLM the open question is whether, as the model writes the word "door," anything inside it is actually "looking at" the door, or whether the grounding is smeared diffusely across the whole network in a way no single component owns.

“How a vision-language model internally solves the task of describing an image is far from obvious.”— arXiv:2606.14703 source

The authors' answer is that grounding is, in fact, localized. They built a clever testbed out of comic strips, where the narrative order of events is laid out spatially across panels. Because a comic forces a description to march through space in a known sequence, the researchers could check, panel by panel, whether any attention heads tracked the region being narrated. Using nothing more than a correlation score computed from a handful of forward passes — no training, no gradient descent — they isolated the heads whose attention reliably followed the described region. Those are the gaze heads.

From observation to a lever

Finding a correlation is interesting; the paper's force comes from what happens next. The authors don't just observe the gaze heads — they intervene on them. By applying a single attention-mask intervention to the top-100 gaze heads, which is fewer than 9% of all heads in the model, they could redirect the model's description to any chosen comic panel at 83.1% accuracy. The same intervention applied to a random set of heads fails to move the answer at all, and applying it to every head simply destroys generation. That contrast is the key evidence: the steering effect is specific to the heads identified as gaze heads, not a generic side effect of masking attention.

The control turns out to be smooth as well as discrete. The authors show that switching the gaze target in the middle of generation makes the model wrap up its description of the current panel and move to the new one within a few tokens — the textual equivalent of a person being told to look somewhere else and finishing their sentence before turning their head. This is continuous control over a model's focus at inference time, achieved by editing attention rather than weights.

Why comics, and does it generalize?

One reasonable worry is that comics are a toy domain with conveniently gridded panels, and that the whole effect is an artifact of that structure. The paper addresses this head-on. The same intervention, the authors report, redirects answers to chosen regions in natural images drawn from COCO, the standard photo dataset, not just in cartoons. So the gaze-head mechanism is not a quirk of paneled layouts; it operates on ordinary photographs too.

The mechanism is also not confined to one model. The authors find that gaze heads recur across model sizes from 2 billion to 32 billion parameters, and across multiple VLM architectures. That breadth matters for the central claim: if a phenomenon shows up only in a single checkpoint, it might be an accident of that training run. Showing up across a 16x range of scale and across architectures suggests gaze heads are something VLMs tend to grow as a natural solution to the describe-an-image task. Notably, the authors add an important caveat: some families that use a frozen vision encoder show no comparable head set. In other words, the way the vision and language halves are wired together appears to shape whether this clean, steerable mechanism emerges at all — a useful clue for anyone trying to understand which design choices produce interpretable internals.

What this is good for

The practical pitch is interpretability you can act on. Most mechanistic-interpretability work produces a satisfying explanation and stops there. This paper argues that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal behavior without any retraining. If you can reliably point a model's attention at a region you care about, you get a cheap form of control: forcing a captioning system to attend to a specified bounding box, debugging why a model described the wrong object, or auditing whether a model is genuinely grounding its words in the image versus confabulating from language priors.

There is a flip side worth naming plainly. A lever that steers a model's described region with one attention mask is also, in principle, a way to make a model misreport what is in an image — to describe a region that suits an attacker rather than the one a user asked about. The paper frames the work as understanding and control, and the honest reading is that the same mechanism cuts both ways, which is precisely why having it mapped out in the open is valuable.

The bigger picture

Strip the framing and the contribution is this: a confusing capability — visual grounding in a multimodal model — turns out to route through a small, findable, editable set of components, and the finding holds across scale and architecture. That is the kind of result that makes a black box feel a little less black. It does not claim that gaze heads are the only place grounding lives, and it is careful about the frozen-encoder exception. But it converts a vague intuition that "the model attends to what it describes" into a measured, manipulable mechanism. For a field that often has to choose between explanations it cannot test and controls it cannot explain, getting both at once — from a few forward passes and an attention mask — is the part worth paying attention to.

A Vision-Language Model Has 'Eyes' Inside It — and You Can Move Where They Look