Symposium on Advances in Audio Signal Processing

Welcome to the Symposium on Advances in Audio Signal Processing, where we bring together leading researchers in the field to share their latest insights into speech, voice, and audio technologies. This event will cover a range of topics, from auditory feedback modulation to speaker recognition and deep learning-based audio coding. Below, you will find an overview of our distinguished speakers and their talks.

Byeong Hyeon Kim – DNN-Based Audio Coding with Prior Knowledge in Signal Processing and Psychoacoustics:
Byeong Hyeon Kim is a Ph.D. student at Yonsei University, specializing in deep neural network-based audio coding and generative models. His research bridges traditional signal processing techniques with modern AI-driven approaches to optimize audio compression. In his talk, Byeong Hyeon will explore how incorporating psychoacoustic models into deep learning frameworks can enhance neural audio codecs. By leveraging human auditory perception, these techniques improve performance under constrained bitrates, demonstrating a powerful synergy between classical and machine learning-based audio coding methods.
Dr. Isabel Schiller – Auditory Feedback Modulation with VQ-Synth: A New Tool for Voice Therapy and Training?:
Dr. Isabel Schiller is a clinical linguist and researcher at the Institute of Psychology at RWTH Aachen University, specializing in auditory cognition and speech perception. Her research explores how voice quality perturbations affect auditory perception and vocal adaptation. In this talk, Dr. Schiller will introduce VQ-Synth, a real-time voice-quality manipulation tool developed for auditory feedback modulation experiments. She will discuss its potential applications in voice therapy and training, alongside insights gained from participant studies on induced dysphonia and vocal compensation.
Jenthe Thienpondt – Speaker Embeddings: Advances, Challenges, and Real-World Applications:
Jenthe Thienpondt is a senior machine learning researcher at IDLab (imec-Ghent University) with expertise in speaker recognition, diarization, and speech pathology analysis. His work has contributed to the development of the ECAPA-TDNN architecture, widely used in speaker verification challenges. In this presentation, Jenthe will provide an overview of recent advancements in speaker embeddings, highlighting how deep neural networks capture diverse speaker characteristics. He will also discuss their potential role in monitoring speech-related pathologies, demonstrating the intersection of AI and healthcare applications.
Chuan Wen – Sound Quality in DNN-based Hearing-Aid Algorithms:
Chuan Wen's research focuses on improving the sound quality of deep neural network-based hearing-aid algorithms. Traditional hearing aids primarily address outer-hair-cell (OHC) damage but often neglect age- or noise-induced cochlear synaptopathy (CS), which affects auditory-nerve fiber function. To compensate for these impairments, closed-loop AI-driven auditory models such as CoNNear have been developed. However, current implementations introduce undesirable artifacts, affecting sound quality. In this talk, Chuan Wen will present dCoNNear, a dilated CNN-based architecture designed to minimize these artifacts while maintaining accurate auditory processing. He will discuss its implementation across various auditory processing stages, its impact on sound quality, and its potential for real-world hearing-aid applications.

Detailed schedule and further information about the talks can be found below. Join us for an insightful day of discussions and networking as we explore the latest advancements in audio signal processing!

Here is the program:

Time Speaker Topic
9:55-10:00 Nilesh Madhu Introduction
10:00-10:30 Byeong Hyeon Kim (Yonsei University, KR) DNN-Based Audio Coding with Prior Knowledge in Signal Processing and Psychoacoustics
Abstract:
This work provides an overview of DNN-based audio coding research conducted at the DSP&AI Lab. Audio coding aims to represent signals with minimal bits while preserving quality, formulated as an optimization problem constrained by a limited bit budget. Traditional codecs have been designed by leveraging digital signal processing theories and psychoacoustics, which analyze how humans perceive sound. DNN-based audio coding can also benefit from this prior knowledge to address such constrained optimization problems. For instance, psychoacoustic models can enhance the performance of neural audio coding by weighting loss functions and discriminators or serving as differentiable training objectives. Compared to conventional loss functions, these objectives align more closely with human perception, improving performance under limited bitrates and model capacities. Additionally, the expressive power of DNNs and generative models can be better utilized with guidance from psychoacoustics. By applying DNNs selectively to quantization or using generative models only for perceptually irrelevant components, the codec pipeline can effectively combine the strengths of traditional codecs and DNNs.
10:30-10:35 Break
10:35-11:15 Dr. Isabel Schiller (RWTH Aachen University, DE) Auditory Feedback Modulation with VQ-Synth: A New Tool for Voice Therapy and Training?
Abstract:
The ability to manipulate voice quality in real time has significant implications for both research and clinical applications, particularly in the context of auditory feedback modulation (AFM). In AFM experiments, participants hear an acoustically altered version of their own voice through headphones as they speak, with these perturbations typically triggering vocal adaptations in response.
In this talk, I will give an overview of the VQ-Synth project, in which we investigated how different voice-quality perturbations affect speakers’ auditory perception and vocal responses. The AFM system used for this purpose, VQ-Synth, was developed in collaboration with Kiel University. Initially implemented in Matlab, it was later optimized in ANSI-C to minimize processing delay. Integrated with a graphical user interface (GUI), VQ-Synth provides a robust framework for psychological experiments.
As part of this project, we conducted a series of participant experiments to identify resynthesis settings that would successfully induce the perception of hoarseness (dysphonia) in participants’ auditory feedback and to determine whether this would trigger compensatory vocal responses, leading to voice quality improvements. This talk will include our first findings as well as a discussion of potential applications of the VQ-Synth system in voice therapy and training.
11:15-11:20 Break
11:20-12:00 Jenthe Thienpondt (Ghent University, BE) Speaker Embeddings: Advances, Challenges, and Real-World Applications
Abstract:
After the performance leap induced by deep learning models in various scientific fields, the speaker recognition community followed swiftly. Current deep neural network-based speaker embeddings are able to robustly capture a wide range of speaker characteristics, including gender, language and emotional tonality based on surprisingly short speech utterances. In this presentation, we will provide a concise overview of the recent advancements in speaker embeddings. Subsequently, we will discuss our recent research investigating their potential application in monitoring speech-related pathologies.
12:00-12:05 Break
12:05-12:45 Chuan Wen Sound Quality in DNN-based Hearing-Aid Algorithms.
Abstract:
Current hearing aids typically focus on addressing outer-hair-cell (OHC) damage and associated hearing sensitivity loss, but do not consider age- or noise-exposure-related damage to auditory-nerve fibers (i.e., cochlear synaptopathy, CS). To compensate for individual and combined CS and OHC damage patterns, closed-loop systems that include biophysical models of (impaired) auditory signal processing can generate personalized sound processing algorithms. These systems are particularly powerful when implemented as deep neural networks (DNNs), allowing for sound processing algorithms to be optimized via backpropagation. One such system, CoNNear, employs autoencoder-based models of auditory processing that simulate cochlear mechanics, inner-hair-cell function, and auditory-nerve fiber activity. These systems are trained to minimize the differences in auditory processing between normal and impaired hearing models, making them suitable for AI hardware integration. However, such end-to-end systems introduce different artifacts than traditional sound processors, such as tonal artifacts from transposed convolutions in CNN-based auditory modules. The artifacts propagate within the closed-loop framework and ultimately become overamplified and audible in the resulting hearing-aid algorithm.
To address this challenge, we propose a dilated CNN-based architecture (dCoNNear) that comprises a sequence of stacked memory blocks, which are most promising and artifact-free for closed-loop audio processing. We applied the dCoNNear architecture to all auditory elements inside the closed-loop system as well as to the sound processors, and evaluated the sound quality and the compensation accuracy of the resulting algorithms. Our results show that dCoNNear can not only accurately simulate all processing stages of a non-DNN-based SOTA biophysical auditory processing system, but also does so without introducing spurious and audible artifacts in the resulting sound processors. The predicted restoration accuracy for simulated auditory-nerve population responses suggests that our algorithms can be used for both OHC and CS pathologies.