INTERSPEECH 2026

Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec

Haixin Zhao and Nilesh Madhu · Ghent University - imec, Belgium

We propose cNAC-SE, a generative speech enhancement model built on a continuous neural audio codec (NAC) framework. Unlike discrete latent modelling approaches, cNAC-SE operates in the continuous latent space of a VQ-based codec, enabling robust and high-fidelity noise suppression.

Below we provide perceptual listening examples from the DNS3 public test set, comparing the noisy input, the open-sourced StoRM baseline, the discrete variant dNAC-SE, and our proposed cNAC-SE. Across the presented DNS3 examples, while dNAC-SE achieves substantial noise reduction, it occasionally exhibits a loss of speech brightness. In contrast, cNAC-SE leverages continuous latent-space modelling to better preserve speech fidelity and high-frequency details, while further improving noise suppression compared with both StoRM and dNAC-SE, producing cleaner enhanced speech with less residual noise and improved overall listening quality.

Noisy Input StoRM Enhanced dNAC-SE cNAC-SE (Proposed)

Audio Samples