Sequence-to-Sequence Speech Patent | NeuralDocket

A February 2020 grant describes mapping one sequence to another to fix speech-recognition output. A small window into how transcription gets cleaned up.

Here's the question under the hood: when a speech recognizer hears audio and guesses words, those guesses are imperfect. How do you fix them automatically? One answer is to treat correction as a translation problem, translate the noisy guess into the clean intended text. That's sequence-to-sequence.

The grant US10573296B1 (issued February 25, 2020) describes reconciliation between a simulator and recognizer output using exactly this seq-to-seq mapping, with CPC tags in G10L (speech processing) and G06N 20/00 (machine learning generally). The same family includes US10559299B1 from two weeks earlier. Together they document a learned correction layer sitting on top of recognition.

“A synthetic training data item comprising a first sequence of symbols that represent a synthetic sentence output by a simulator is received.”— U.S. Patent No. 10,573,296 source

What the claims actually describe is more specific than “clean up transcripts.” The independent method claim starts with a training dataset where each item pairs acoustic features derived from audio with a first sequence of symbols that represents the utterance. An acoustic model processes those features and emits a second sequence of symbols, one that differs from the first. The dataset is then rebuilt: the acoustic features are stripped out, and the acoustic model's output sequence is added in. A machine-learning model is trained on that modified dataset as a sequence-to-sequence converter, with the first symbol sequence fed in as input and the second used as the target that adjusts the network's weights. In plain terms, the system learns to reproduce the specific mistakes and quirks a given acoustic model makes, so that a downstream component can be trained to expect them.

The reason this matters is buried in the abstract's notion of a “statistically significant mismatch.” The disclosure draws a careful three-way distinction: a synthetic sentence from a simulator, the symbols an acoustic model would actually emit if it heard that sentence spoken, and a corrected sequence. The trained model rewrites the clean synthetic text into the form the acoustic model would really produce, so that there is no significant mismatch between the rewritten version and real acoustic-model output. That is the inversion that makes the method useful: rather than fixing recognizer errors after the fact, it makes synthetic training text look like recognizer output, so a conversational agent's downstream model can be trained on cheap synthetic data that behaves like the messy real thing.

The mechanism matters because it's the ancestor of how modern systems work. Today's speech and language models are seq-to-seq at their core, audio-to-text, text-to-text, one sequence in, another out. This 2020 grant is a narrow, applied instance of the architecture that would soon swallow the whole field. It also names its own end goal explicitly: the claim states that using the converter to help train a transcoder of a conversational agent “improves an accuracy of the conversational agent.” The patent is, in other words, a piece of voice-assistant plumbing, a way to manufacture realistic training data at scale.

Connect it to the sector story: by 2020, the transformer-driven seq-to-seq revolution was underway in research, and patents like this show it diffusing into shipped speech products. The reconciliation framing, use a learned model to clean up or simulate another system's output, also prefigures today's ensembles where one model checks or feeds another. The dependent claims push further in the same direction, adding a speaker-characteristic vector to each data item so the converter can condition its rewriting on who is speaking.

It helps to walk the data path the claim lays out, step by step. Start with real recordings: each training item carries acoustic features derived from audio for an utterance, paired with a first symbol sequence that represents what was said. The acoustic model ingests those features and emits a second symbol sequence, its best guess, which by construction differs from the reference. The method then performs a deliberate substitution at the dataset level: it removes the acoustic features entirely and replaces them with the acoustic model's output sequence. The result is a corpus of paired symbol sequences, reference in, acoustic-model-output out, with the audio thrown away. A sequence-to-sequence model trained on those pairs learns the mapping from “what was meant” to “what this particular recognizer hears,” weights adjusted with the recognizer output as the target.

That direction is the counterintuitive part and the source of the leverage. Most people imagine error correction running forward, from a garbled transcript to a clean one. Here the trained converter runs the other way for data generation: it takes clean synthetic sentences from a simulator and rewrites them into the noisy, error-shaped form the acoustic model would produce, so the synthetic data is statistically indistinguishable, in the patent's words, no “statistically significant mismatch”, from real recognized text. A conversational agent's downstream components can then be trained on essentially unlimited simulated dialogue that nonetheless carries the fingerprints of real-world recognition error. It is a way to buy realism without buying recordings.

One more disclosed wrinkle is worth surfacing because it shows the method anticipating speaker variation. A dependent claim has a second machine-learning model process one or more of the training data items to generate “a vector representing one or more speaker characteristics,” which is then added to the data items as the dataset is modified. That conditioning vector lets the sequence-to-sequence converter shape its rewriting around who is speaking, since the errors a recognizer makes for one voice, an accent, a pitch, a speaking rate, differ from those it makes for another. Folding a speaker embedding into the training data is how the method keeps its simulated error patterns realistic across a population rather than collapsing to an average voice. It is a small claim, but it points at the same theme as the rest of the grant: most of the work is in manufacturing training data that behaves like the real, messy, speaker-dependent output of a deployed recognizer.

The careful note: this is a granted patent, so it's an enforceable right, but its scope is whatever the claims say, not the broad idea of seq-to-seq. The value here is historical, a dated, concrete marker of when learned sequence mapping became production speech infrastructure, and a reminder that a surprising amount of the work is not recognizing speech at all but generating training data that mimics the errors of the systems that do.

How Speech Recognition 'Reconciles' What It Hears — a 2020 Grant

Comments