How NVIDIA's Invertible Audio-Synthesis Neural Network Works

A patent granted June 30, 2026 describes a flow-style vocoder that runs the same network two ways: training maps recorded speech to Gaussian noise, and inference maps noise back to a waveform. Here is how the mechanism works, and where it sits among a wave of generative-media grants issued to NVIDIA the same day.

Forget the name for a second — "invertible neural network" — and start with the problem it solves. A vocoder is the part of a text-to-speech system that produces the actual sound: given a compact description of what the speech should sound like, it has to output tens of thousands of audio samples per second that a speaker can play. For years the highest-quality neural vocoders did this the slow way, generating one audio sample at a time, each conditioned on the samples before it. That is accurate but sequential, and sequential is expensive when you need 22,050 or 48,000 samples for every second of speech. A patent granted on June 30, 2026 and assigned to NVIDIA, US12670895B2, "Invertible neural network to synthesize audio signals," describes a different route to the same output — one that trades the one-sample-at-a-time loop for a single reversible transform.

Systems and methods to help synthesize a second audio signal based, at least in part, on one or more neural networks trained using one or more characteristics of a first audio signal. Systems and methods to train one or more neural networks to synthesize a second audio signal based, at least in part, on one or more characteristics of a first audio signal.— Invertible neural network to synthesize audio signals, US12670895B2

The way this actually works

The mechanism turns on one word in the title: invertible. Most neural networks are one-way functions — you feed data in, you get a prediction out, and you cannot cleanly run the arrow backward. This network is built from layers that can be run in either direction, and NVIDIA's grant uses that property deliberately. As the claims describe it, the network is trained by converting a compact representation of a first audio signal — the patent specifies a mel-spectrogram, the frequency-over-time summary that most speech systems already produce — and, together with the audio itself, generating one or more Gaussian values. In plain terms: during training, real recorded speech is pushed forward through the layers until it comes out the other side looking like samples drawn from a simple bell-curve (Gaussian) distribution. The network learns the exact, reversible transformation that carries messy real audio onto tidy noise.

That is only half the machine, and the useful half is the reverse. Because every layer is invertible, once training has fixed the weights you can run the whole thing backward: sample fresh Gaussian noise, condition on the mel-spectrogram you want to voice, and push the noise back through the inverted layers to land on a waveform. The record states the point directly — the neural networks are "trained in a first direction and to generate inferences in a second direction." One network, two directions: forward to learn, backward to synthesize. There is no separate generator to train; the synthesizer is the analysis network run in reverse.

Two more disclosed details make the approach practical. First, the Gaussian values are produced by invertible layers that the patent describes as coupling layers containing an audio transform — the standard way to build a reversible network is to split the signal, transform one part using the other, and keep the operation algebraically undoable. Second, the record specifies that this audio transform "uses dilated convolutions." Dilated convolutions are how a network sees far up and down a waveform without exploding in size: by skipping samples at widening intervals, a modest stack of filters can cover a long stretch of audio, which is exactly what you need to model the long-range structure of speech. Put together, the disclosed vocoder is a conditional normalizing-flow model: a mel-spectrogram sets the target, Gaussian noise supplies the randomness, and a reversible dilated-convolution network maps between them.

Where it sits in the field

The state of the art the disclosure speaks to is the split between autoregressive vocoders, which are high-fidelity but generate sequentially, and flow-based vocoders, which learn an invertible mapping so that synthesis can run in a single parallel pass. NVIDIA's grant sits squarely in the second camp; the classification reflects it, leading with G10L 13/047 for the details of speech synthesis and adding G06N 3/045 and G06N 3/047 for the neural-architecture substance. The named inventors — Ryan Prenger, Rafael Valle, and Bryan Catanzaro — sit in NVIDIA's applied deep-learning research line, and one claim even contemplates the synthesis system being embodied in "a vehicle," a reminder that a fast parallel vocoder is as much an in-car voice-assistant component as a cloud one.

What makes this grant worth reading as a mechanism rather than a one-off is the company it keeps. The same June 30 grant drop issued a run of NVIDIA patents directed at generating and cleaning up media with neural networks. US12670600B2 describes disentangling image attributes with a neural network by using pose and appearance information to construct a foreground and background and reconstruct the input image — an unsupervised way to learn what parts of an image can be varied independently. US12670691B2 is directed to preserving fine detail that a denoiser would otherwise erase, by extracting high-frequency pixel data and reweighting the denoised output. The throughline is the same instinct that drives the vocoder: learn a transformation that adds or removes structure without throwing away the parts that carry perceptual quality.

The video-side grants extend the pattern into time. US12670541B2 describes a "warped external recurrent neural network" for reconstructing motion blur, depth-of-field, and anti-aliasing effects across a sequence of frames — notable because, rather than making every layer recurrent, it warps the final layer's output and feeds it back as part of the next frame's input, a lighter way to keep results temporally stable. US12670543B2 is narrower and more telling about productization: it claims an application programming interface that simply indicates whether one or more neural networks are available to perform frame interpolation. That is the plumbing layer — the way a generative model gets exposed to software that calls it — sitting in the same drop as the models themselves.

Read together, the vocoder is one instrument in a section. NVIDIA's June 30 grants describe neural networks that synthesize speech, disentangle and reconstruct images, interpolate and stabilize video, and preserve detail through denoising — and, in the frame-interpolation API grant, the interface that lets applications reach those networks. The invertible audio-synthesis patent is the clearest single illustration of the underlying idea, because its whole design is a reversible bridge between structured signal and simple noise. Learn the bridge in one direction; walk across it in the other. Strip the terminology and that is the mechanism: a network trained to see speech as noise, so that it can turn noise back into speech.

How NVIDIA's Invertible Neural Network Turns a Spectrogram Back Into Speech

The way this actually works

Where it sits in the field

Comments