Forget the name for a second. Old systems bolted a vision model and a language model together with tape. A unified transformer instead ingests image features and text tokens into one shared model, so the same network reasons across both. Salesforce's grant US11562147B2 (issued January 24, 2023) describes exactly this, built on BERT.
Under the hood, the trick is a shared representation. Image regions get encoded into vectors that live in the same space as word vectors, and the transformer's attention mechanism lets text attend to image parts and vice versa. Ask what color is the car and the model's attention can actually point at the car. The CPC tags G06F 40/35 (dialogue) and G06N 3/08 (neural learning) capture the dual job.
This is the conceptual ancestor of the multimodal assistants everyone uses now. The leap from a model that reads to a model that reads and sees and talks about both is exactly what a unified transformer enables. Salesforce filed on a concrete, enterprise-flavored version, vision-grounded dialogue, before the consumer multimodal wave crested.
Connect it to the sector narrative: the value of unifying modalities is that one model, one training run, and one serving stack replace several. That's cheaper and more capable at once, which is why nearly every major lab converged on multimodal transformers. The 2023 grant is a dated waypoint on that convergence.
The careful note: it's a granted patent with claims that bound the real scope, and unified vision and dialogue is an architecture, not a capability score. As a marker it's clean, by early 2023, fusing sight and conversation in one transformer was mature enough for a major software company to own.