Ask a person what happened in a video and they do something effortless: they fuse what they saw with what they heard. The slamming door, the off-screen voice, the music swelling under a scene — sound and image are bound together, and the meaning lives in the binding. Teaching machines to do the same has been bottlenecked not by model size but by data, and a new paper by Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, and Caifeng Shan, posted to arXiv on June 12, 2026, goes after the data problem directly. Their contribution, OmniVideo-100K, is built to teach models the very thing the standard pipeline accidentally destroys: the connection between a sound and its source.

How the usual pipeline breaks the link

To see why this is needed, look at how audio-visual question-answering data is normally manufactured. The dominant recipe is a "video-caption-QA" paradigm: chop the video into short clips, generate captions, and synthesize questions from them. The authors point to two specific failures baked into that approach. First, the methods typically describe audio and visuals as separate modalities — one caption for what you see, another for what you hear. That decoupling severs the inherent association between a sound and the visual source that produced it. The model learns that a bark exists and that a dog exists, but not that the dog is barking.

Second, processing each short clip independently breeds inconsistency. The same person or object, described in isolation across different segments, gets labeled inconsistently from one clip to the next, so the model never gets a stable, continuous account of an entity moving through a video. And because long-text comprehension and question synthesis are crammed into a single step, the generated questions tend to stay local — pinned to one moment — yielding questions that lack long-term temporal connections and the deeper cross-modal reasoning that makes a video question genuinely hard. The result is training data that quietly teaches models to be shallow.

Two mechanisms that rebuild the connection

OmniVideo-100K is produced by an automated data engine with two mechanisms aimed squarely at those failures. The first is Entity-Anchored Video Scripting. Instead of disconnected clip captions, the engine transforms each video into a structured script comprising a summary, a list of the main entities, and segment-wise descriptions that fuse audio and visual together. The entity list is the clever part: it serves as a global prior, a shared roster that every segment refers back to. That roster does two jobs at once — it enforces cross-segment referential consistency, so the same character is the same character throughout, and it reconstructs the audio-visual associations the old pipeline had severed, tying each sound back to the entity that made it.

The second mechanism is Clue-Guided QA Generation, and it inverts the usual order of operations. Rather than asking a model to read a script and immediately spit out questions, the engine first prompts the model to mine the script for high-value clues — connections that span multiple segments and multiple modalities. Only then, with those cross-segment multimodal clues in hand, does it generate the actual question-answer pairs. Because the questions are built on clues that already require linking distant moments and combining sound with sight, the resulting Q&A pushes toward real reasoning rather than single-moment lookup. This is the "evidence chains" idea in the dataset's framing: a question is grounded in a traceable chain of multimodal evidence, not a single caption.

Does it actually help?

Using this engine, the authors construct the instruction-tuning dataset OmniVideo-100K and pair it with a human-verified test set, OmniVideo-Test, so evaluation is not left to automated scoring alone. The training results are where the approach proves its worth. Fine-tuning three different models — VITA-1.5, Qwen2.5-Omni-7B, and Qwen3-Omni-30B — on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test. That the gains hold across models of different sizes and lineages is a good sign that the dataset, not a quirk of one architecture, is doing the work.

Crucially, the benefit is not confined to the authors' own test. They report strong generalization, with improvements of up to 12.64% carrying over to established external benchmarks like Daily-Omni and JointAVBench. That cross-benchmark transfer is the result that matters most: a dataset can always be tuned to win its own test, but lifting performance on independent benchmarks is what distinguishes genuinely better training data from a benchmark-specific shortcut.

Why this is the kind of work that compounds

It is easy to overlook dataset papers in a field obsessed with model releases, but the omni-modal frontier — models meant to take in video, audio, and text together — is fundamentally data-starved in exactly the dimension OmniVideo-100K targets. Most available data teaches modalities in parallel; very little teaches the relationships between them across time. By engineering consistency and cross-modal clue mining directly into the generation pipeline, the authors are not just adding examples; they are encoding a better notion of what an audio-visual question should demand. Datasets like this tend to compound, because every model trained downstream inherits the reasoning structure baked into the data.

A few honest qualifiers keep expectations calibrated. The data engine is itself model-driven — scripts and clues are generated by models — so the human-verified OmniVideo-Test is doing important load-bearing work as a check on quality, and any systematic blind spots in the generators could propagate. The headline 20.59% and 12.64% figures are upper-bound "up to" numbers across configurations rather than uniform gains everywhere. And cross-modal grounding remains genuinely hard; better data narrows the gap rather than closing it. Still, the core move is the right one. The old pipeline taught models that sound and sight are two separate captions; OmniVideo-100K teaches them that they are one story — and the benchmarks suggest the models are listening.