Dynamic Slimmable Speech Enhancement Network with Metric Guided Training



This page exhibits some audio samples and more figures with more details about the same results in the paper for our submission 'Dynamic Slimmable Speech Enhancement Network with Metric Guided Training' to ICASSP 2026.

Note:Study on the contributions of the individual dynamic modules:

The main contributions of the individual dynamic modules lie in reducing computational complexity rather than improving performance. When all dynamic modules are applied, the dynamic model achieves comparable performance to the static FTF-Net while requiring only 69% of the multiply-accumulate operations (MACs). Therefore, performing ablation studies on individual modules is unnecessary, as the expected performance lie in a negligible gap between the DSN and the static FTF-Net, and would not provide additional insights. A proper way to quantify the contributions of each dynamic module is their complexity reduction in MACs, which has already been presented in the Table 1. in Section 2.1.

Audio samples and figures with more details about the same results in the paper: Considering the space constraints of a conference paper, this page provides, in addition to audio examples, more figures with more details about the same results in the paper for better appreciating the effectiveness of the proposed dynamic scalable network, and to support the discussions in the submission.
At 1. Audio samples, we present some audio samples with their corresponding average activation ratios and MACs/s.
At 2. More Figures With More Details About the Same Results in the Paper, we provide additional scatter plots that are omitted from the main submission due to space constraints. These plots further confirm that the proposed MGT dynamic model achieves comparable performance to the static FTF-Net and achieves better metric scores than the standard dynamic model and the zero-activation baseline, not only in terms of average metrics, but also consistently across the majority of individual samples.




1. Audio samples of the proposed MGT-DSN model with average activation ratios and MACs/s of corresponding samples labelled.

Please note that the unit used for reporting the average computational complexity per sample is MACs/s. The unit "MACs" used in the current submission is a typo and should be "MACs/s" (Multiply-Accumulate operations per second). We will correct this in the camera-ready version.
Average Activation Ratio & MACs/s for MGT Dynamic Model Noisy Speech Enhanced Speech by MGT Dynamic Model Enhanced Speech by Static FTF-Net (339 M MACs/s) Clean Target
0.62 (240 M MACs/s)
0.61 (239 M MACs/s)
0.40 (205 M MACs/s)
0.36 (199 M MACs/s)
0.57 (232 M MACs/s)
0.59 (235 M MACs/s)
0.58 (234 M MACs/s)
0.46 (215 M MACs/s)
0.48 (218 M MACs/s)
0.42 (208 M MACs/s)




2. More figures with more details about the same results in the paper

2.1 Overall Evaluation Across All Data (distinct from the evaluation across SNR in Fig. 6)

To further evaluate the proposed metric-guided training and DSN model, overall evaluation results across all data (distinct from the evaluation across SNR in Fig. 6) in various metrics are presented in the following Fig A.1. Results indicate that the propsed MGT-DSN model achieves comparable performance to the static lightweight SOTA, FTF-Net, with only 76% computational complexity.

Notably, the effectiveness of the MGT is further highlighted by the improvement compared to three standard dynamic models with equal or more computational load.
Overall Evaluation
Fig A.1: Scatter plot of ESTOI scores for the proposed dynamic models benchmarked against two static baselines. (R is the average activation ratio.)

2.2 Scatter plots of the proposed dynamic models benchmarked against two static baselines. (They are the scatter-plot extensions of Fig.6 in the submission.)

To further illustrate the consistent improvements achieved by the proposed MGT dynamic slimmable model (not limited to average metric, but also broadly consistent enhancements across most individual test samples.), the scatter plots in the following figures (Fig A.2 - A.5) extend the analysis of Fig. 6 from the submission. These plots compare the instrumental metric performance of the two proposed dynamic models and the static FTF-Net baseline, each against the zero-activation baseline, which is represented on the x-axis. The y-axis displays the scores of the three evaluated methods. The alignment of points above the y=x line indicates consistent performance improvements.

The observed trends in these scatter plots are consistent with those presented in Fig. 6. Notably, the proposed MGT dynamic model achieves performance comparable to the static FTF-Net baseline across varying SNR conditions, while utilizing only 69% of the multiply-accumulate operations (MACs). At lower SNRs, most test samples enhanced by the proposed MGT dynamic model, outperform the zero-activation baseline, as shown by the clustering of points above the diagonal. The MGT dynamic model further enhances performance beyond that of the standard dynamic model, with the same activation ratio and computational cost. At higher SNRs, the improvement trend becomes less prominent. This is expected, as the MGT dynamic model allocates fewer network resources to easier (i.e., higher-quality) samples, a behavior discussed in Section 3.B and illustrated in Fig. 7 of the submission.

'Overall, as further hilighted by scatter plots, the proposed MGT dynamic model consistently achieve comparable results to the static FTF-net, with much less computational complexity. In line with the discussions in the submission, these scatter plots also confirm that the MGT facilitates appropriate allocation of network resources according to the severity of input signal distortion, thereby contributing to improved enhancement performance.'
ESTOI Scatter
Fig A.2. Scatter plot of ESTOI scores for the proposed dynamic models and the static FTF-Net baseline, each benchmarked against the zero-activation baseline. (R is the average activation ratio.)
DNSOVL Scatter
Fig A.3. Scatter plot of DNS-MOS OVRL scores for the proposed dynamic models and the static FTF-Net baseline, each benchmarked against the zero-activation baseline. (R is the average activation ratio.)
SI-SDR Scatter
Fig A.4. Scatter plot of SISDR scores for the proposed dynamic models and the static FTF-Net baseline, each benchmarked against the zero-activation baseline. (R is the average activation ratio.)
PESQ Scatter
Fig A.5. Scatter plot of PESQ scores for the proposed dynamic models and the static FTF-Net baseline, each benchmarked against the zero-activation baseline. (R is the average activation ratio.)