ASPIRE Symposium 2024

Symposium on Advances in Audio Signal Processing

Welcome to the Symposium on Advances in Audio Signal Processing, where we bring together leading researchers in the field to share their latest insights into speech, voice, and audio technologies. This event will cover a range of topics, from auditory feedback modulation to speaker recognition and deep learning-based audio coding. Below, you will find an overview of our distinguished speakers and their talks.

Byeong Hyeon Kim – DNN-Based Audio Coding with Prior Knowledge in Signal Processing and Psychoacoustics:

Byeong Hyeon Kim is a Ph.D. student at Yonsei University, specializing in deep neural network-based audio coding and generative models. His research bridges traditional signal processing techniques with modern AI-driven approaches to optimize audio compression. In his talk, Byeong Hyeon will explore how incorporating psychoacoustic models into deep learning frameworks can enhance neural audio codecs. By leveraging human auditory perception, these techniques improve performance under constrained bitrates, demonstrating a powerful synergy between classical and machine learning-based audio coding methods.

Dr. Isabel Schiller – Auditory Feedback Modulation with VQ-Synth: A New Tool for Voice Therapy and Training?:

Dr. Isabel Schiller is a clinical linguist and researcher at the Institute of Psychology at RWTH Aachen University, specializing in auditory cognition and speech perception. Her research explores how voice quality perturbations affect auditory perception and vocal adaptation. In this talk, Dr. Schiller will introduce VQ-Synth, a real-time voice-quality manipulation tool developed for auditory feedback modulation experiments. She will discuss its potential applications in voice therapy and training, alongside insights gained from participant studies on induced dysphonia and vocal compensation.

Jenthe Thienpondt – Speaker Embeddings: Advances, Challenges, and Real-World Applications:

Jenthe Thienpondt is a senior machine learning researcher at IDLab (imec-Ghent University) with expertise in speaker recognition, diarization, and speech pathology analysis. His work has contributed to the development of the ECAPA-TDNN architecture, widely used in speaker verification challenges. In this presentation, Jenthe will provide an overview of recent advancements in speaker embeddings, highlighting how deep neural networks capture diverse speaker characteristics. He will also discuss their potential role in monitoring speech-related pathologies, demonstrating the intersection of AI and healthcare applications.

Detailed schedule and further information about the talks can be found below. Join us for an insightful day of discussions and networking as we explore the latest advancements in audio signal processing!

Here is the program:

Time	Speaker	Topic
9:55-10:00	Nilesh Madhu	Introduction
10:00-10:30	Byeong Hyeon Kim (Yonsei University, KR)	DNN-Based Audio Coding with Prior Knowledge in Signal Processing and Psychoacoustics View Abstract Abstract: This work provides an overview of DNN-based audio coding research conducted at the DSP&AI Lab. Audio coding aims to represent signals with minimal bits while preserving quality, formulated as an optimization problem constrained by a limited bit budget. Traditional codecs have been designed by leveraging digital signal processing theories and psychoacoustics, which analyze how humans perceive sound. DNN-based audio coding can also benefit from this prior knowledge to address such constrained optimization problems. For instance, psychoacoustic models can enhance the performance of neural audio coding by weighting loss functions and discriminators or serving as differentiable training objectives. Compared to conventional loss functions, these objectives align more closely with human perception, improving performance under limited bitrates and model capacities. Additionally, the expressive power of DNNs and generative models can be better utilized with guidance from psychoacoustics. By applying DNNs selectively to quantization or using generative models only for perceptually irrelevant components, the codec pipeline can effectively combine the strengths of traditional codecs and DNNs.
10:30-10:35	Break
10:35-11:15	Dr. Isabel Schiller (RWTH Aachen University, DE)	Auditory Feedback Modulation with VQ-Synth: A New Tool for Voice Therapy and Training? View Abstract Abstract: The ability to manipulate voice quality in real time has significant implications for both research and clinical applications, particularly in the context of auditory feedback modulation (AFM). In AFM experiments, participants hear an acoustically altered version of their own voice through headphones as they speak, with these perturbations typically triggering vocal adaptations in response. In this talk, I will give an overview of the VQ-Synth project, in which we investigated how different voice-quality perturbations affect speakers’ auditory perception and vocal responses. The AFM system used for this purpose, VQ-Synth, was developed in collaboration with Kiel University. Initially implemented in Matlab, it was later optimized in ANSI-C to minimize processing delay. Integrated with a graphical user interface (GUI), VQ-Synth provides a robust framework for psychological experiments. As part of this project, we conducted a series of participant experiments to identify resynthesis settings that would successfully induce the perception of hoarseness (dysphonia) in participants’ auditory feedback and to determine whether this would trigger compensatory vocal responses, leading to voice quality improvements. This talk will include our first findings as well as a discussion of potential applications of the VQ-Synth system in voice therapy and training.
11:15-11:20	Break
11:20-12:00	Jenthe Thienpondt (Ghent University, BE)	Speaker Embeddings: Advances, Challenges, and Real-World Applications View Abstract Abstract: After the performance leap induced by deep learning models in various scientific fields, the speaker recognition community followed swiftly. Current deep neural network-based speaker embeddings are able to robustly capture a wide range of speaker characteristics, including gender, language and emotional tonality based on surprisingly short speech utterances. In this presentation, we will provide a concise overview of the recent advancements in speaker embeddings. Subsequently, we will discuss our recent research investigating their potential application in monitoring speech-related pathologies.
12:00-12:05	Break
12:05-12:45	Morgan Thienpont	To be updated soon View Abstract Abstract: To be updated soon

Symposium on Advances in Audio Signal Processing

Byeong Hyeon Kim – DNN-Based Audio Coding with Prior Knowledge in Signal Processing and Psychoacoustics:

Dr. Isabel Schiller – Auditory Feedback Modulation with VQ-Synth: A New Tool for Voice Therapy and Training?:

Jenthe Thienpondt – Speaker Embeddings: Advances, Challenges, and Real-World Applications:

Abstract:

Abstract:

Abstract:

Abstract: