Symposium on Advances in Audio Signal Processing

We are pleased to welcome Professor Emanuël Habets, who is jointly affiliated with Fraunhofer IIS and the Friedrich-Alexander Universität Erlangen-Nürnberg, to present a keynote on "Neural Directional Filtering and Dynamic Slimmable Networks".

Abstract:
In this talk, I will provide a brief overview of the research conducted in my group at the International Audio Laboratories Erlangen. I will then delve into two recent contributions.
First, I will introduce Neural Directional Filtering, a novel approach to far-field speech capture that utilizes deep neural networks to achieve precise directivity control. This approach outperforms traditional linear and parametric filtering approaches, enabling high spatial selectivity with a small number of microphones. This oCers significant advantages for applications where spatial selectivity is critical, such as in teleconferencing and hearables.
Next, I will discuss Dynamic Slimmable Neural Networks for speech separation designed to address the computational ineCiciency of conventional speech separation networks. In resource-constrained environments, such as mobile or embedded devices, processing all input frames using the entire network can lead to unnecessary computational overhead, particularly during silent intervals or when a single speaker dominates. To mitigate this ineCiciency without increasing memory requirements, we propose a dynamic architecture with slimmable layers that adjust computational eCort in accordance with the input’s characteristics. This approach, applied to dual-path networks for speech separation, significantly reduces computational cost while maintaining high separation performance, as demonstrated on the WSJ0-2mix dataset.
These two advancements reflect a growing trend in speech processing: leveraging neural networks to enhance both the eCiciency and eCectiveness of spatial and spectral processing, with promising implications for real-world applications

Following this keynote, there will be six short presentations from international researchers at UGent, Aalborg University, and the University of Hamburg. These talks will cover the latest advancements in acoustic signal processing, including topics such as source separation, noise reduction, and speech and audio analytics, providing insights into future research directions.

Here is the program:

Time Speaker Topic
10:00-10:10 Nilesh Madhu Introduction
10:10-11:00 Emanuël Habets (Fraunhofer IIS, DE) Neural Directional Filtering and Dynamic Slimmable Networks
Abstract:
In this talk, I will provide a brief overview of the research conducted in my group at the International Audio Laboratories Erlangen. I will then delve into two recent contributions.
First, I will introduce Neural Directional Filtering, a novel approach to far-field speech capture that utilizes deep neural networks to achieve precise directivity control. This approach outperforms traditional linear and parametric filtering approaches, enabling high spatial selectivity with a small number of microphones. This oCers significant advantages for applications where spatial selectivity is critical, such as in teleconferencing and hearables.
Next, I will discuss Dynamic Slimmable Neural Networks for speech separation designed to address the computational ineCiciency of conventional speech separation networks. In resource-constrained environments, such as mobile or embedded devices, processing all input frames using the entire network can lead to unnecessary computational overhead, particularly during silent intervals or when a single speaker dominates. To mitigate this ineCiciency without increasing memory requirements, we propose a dynamic architecture with slimmable layers that adjust computational eCort in accordance with the input’s characteristics. This approach, applied to dual-path networks for speech separation, significantly reduces computational cost while maintaining high separation performance, as demonstrated on the WSJ0-2mix dataset.
These two advancements reflect a growing trend in speech processing: leveraging neural networks to enhance both the eCiciency and eCectiveness of spatial and spectral processing, with promising implications for real-world applications
11:00-11:05 Break
11:05-11:25 Alina Mannanova (Hamburg University, DE) Meta-Learning for Variable Array Configurations in End-to-End Few-shot Multichannel Speech Enhancement
Abstract:
Nowadays deep neural networks are a common choice for multichannel speech processing as they may outperform the traditional concatenation of a linear beamformer and a postfilter in challenging scenarios. To obtain strong spatial selectivity, these approaches are typically trained for a specific microphone array configuration. However, it was recently shown that such models are sensitive even to small perturbations in the microphones placements. In this paper we propose a method for handling variable array configurations based on model-agnostic meta-learning. We demonstrate that the proposed approach increases robustness to changes in the array configurations, i.e., mismatched conditions, while maintaining the same performance as the array-specific model on matched conditions.
11:25-11:45 Jakob Kienegger (Hamburg University, DE) Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation
Abstract:
Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios.
11:45-12:00 Yanjue Song (ASPIRE, UGent) Generative Speech Enhancement with Self-Supervised Learning Models
Abstract:
Self-supervised learning (SSL) models offer the potential to encode essential speech-related features into the latent space by leveraging large, unannotated datasets. These pre-trained models have demonstrated their effectiveness across various phonetic, semantic, and paralinguistic tasks, such as emotion recognition, speaker identification and automatic speech recognition. But can these pre-trained models be applied to the task of speech enhancement, particularly through a re-synthesis framework? This presentation explores our recent investigations into this question. Given the abundance of pre-trained models, our research begins with SSL model selection. Rather than empirically training all potential systems and selecting the best-performing one, we develop an analytic framework to identify the optimal model for a given task by directly analysing the extracted embeddings (features produced by SSL models). We then focus on improving the performance of neural vocoders based on this optimal model. Specifically, we propose to provide noisy spectrogram in addition to the SSL embeddings to the neural vocoder, to enhance the naturalness of synthesized speech signals.
12:00-12:05 Break
12:05-12:25 Shuai Tao (Aalborg University, DK) DNN-guided Parameter Estimation for Speech Enhancement
Abstract:
Learning-based parameter estimation has shown high accuracy in non-stationary noise environments. In this work, we propose using multi-channel information to estimate multi-channel SPP (MC-SPP) based on deep neural networks (DNNs), which contribute to improving multi-channel speech enhancement performance. Firstly, with the observed signal and the MC-SPP as the training data pairs, one low-parameter DNN model is trained to estimate the MC-SPP. Based on the MC-SPP estimate, the noise power density (PSD) and clean speech PSD matrices are updated recursively. With the clean speech PSD matrix, the steering vector is computed using the covariance subtraction method. Subsequently, the minimum variance distortionless response (MVDR) weight is computed with the clean and noise matrices. To further improve multi-channel speech enhancement performance, a new MVDR modification guided by the MC-SPP estimate is proposed. Finally, spatial filtering is performed by integrating the MVDR beamforming. For experiments, we spatially synthesize a real speech dataset in the isotropic noise fields for training and testing. The PESQ, STOI, and DNSMOS scores are used to evaluate speech quality. The experimental results show that, compared with a recently proposed DNN-guided approach, our proposed method provides an effective statistics estimation approach that can further improve multi-channel speech enhancement performance.
12:25-12:45 Haixin Zhao (ASPIRE, UGent) Bitrate-Informed Coded Speech Enhancement Model
Abstract:
Enhancing speech processed by lossy codecs can significantly improve the resultant signal quality, providing a richer listening experience while reducing listening fatigue. Since codecs generally support several bitrates, deep-learning-based solutions typically train networks in a codec-specific manner or use multi-condition training for each codec-specific network. To utilise the available utterance-level bitrate information, we propose a bitrate-informed model. The experimental study shows that using bitrate-informed layers improves inter-bitrate generalisation capability. More importantly, this only causes a small increase (<1%) in model footprint and no increase in the computational cost.
12:45-13:00 Stijn Kindt (ASPIRE, UGent) Leveraging Distributed Microphones for Enhanced Speech Separation
Abstract:
In many modern meeting rooms, attendees often take their laptops and phones, all equipped with at least one microphone. Typically, hybrid meetings rely on the microphones of one of these devices, or an extra dedicated device present in the room. However, why not utilise all the microphones already present in the room that are positioned near the different speakers? The answer lies in the limitations of traditional speech enhancement algorithms, which make hard assumptions on the number and position of the microphones. This talk will explore innovative methods to overcome these challenges and show how DNNs can be designed to exploit the distributed microphones.
More specifically, the microphones are clustered around sources of interest and inputted in an array agnostic speech extraction DNN. Further, cross cluster information is shared to suppress interfering noise sources.
This system would eliminate the need for a dedicated device in hybrid meeting settings, while improving the user experience of the virtual attendees with clearer, more intelligible speech signals!
13:00-13:05 Nilesh Madhu Closing