The way this actually works is hiding in a sentence from Alphabet’s FY2024 10-K: in 2024 the company “launched Gemini 2.0,” which “can generalize and seamlessly understand, operate across, and combine different types” of information. That phrase, “combine different types,” is the entire definition of a word you keep hearing: multimodal.

Under the hood, the earliest large models were unimodal, they did one kind of input. A language model read and wrote text; an image model handled pictures. Multimodal means one model takes several input types, text, images, audio, video, and reasons across them together, so it can, say, look at a photo and answer a question about it in words.

The mechanism that makes this possible is representation. Each input type gets converted into the same internal numerical form, so once a picture and a sentence are both turned into the model’s shared “language,” the model can relate them. Forget the name for a second: it is one brain that learned to read several kinds of input in a common code.

Why is this in a 10-K? Because Alphabet is telling investors that its flagship model is built around this capability, not bolting it on. When the annual report leads with “understand, operate across, and combine different types,” the company is staking its model strategy on multimodality being the direction the field is heading.

As a plain-language marker for early 2025: “multimodal” is not jargon for jargon’s sake, it is a model that stopped being a specialist and became a generalist across input types, and a company’s own filing is a clean place to see the claim stated plainly. Filing data and the evidence index via EdgarBeast.