Goal

Nowadays, many smart devices, such as smartphones, smart watches, laptops and even smart glasses are present in environments, for example in meeting rooms. Add to this the trend towards smart homes and the Internet of Things, all of which include one or often multiple microphones. We want to share information between the microphones to perform optimal processing with the extensive spatial coverage over the room provided by these distributed microphones. However, the microphone positions are unknown, as well as the microphone relations and correlations.

Our solution is to cluster the microphones concerning their underlying dominant sources. This way, we know which microphones have similar content and which microphones can help to suppress the interference and noise. Further, we use these clusters for subsequent speaker separation and denoising of two concurrent talkers.
WASN generated image
WASN generated image

Clustering

We showed that neural speaker verification network generate embeddings that work well as speaker specific features to cluster on. This we first did with shoe-box accoustics simulated rooms and later confirmed the same for more realistic, CATT model generated room impulse responces . There we also increased the difficulty of the simulation scenario by bringing the sources closer and reducing the time frame for clustering, where the speaker verification features showed impressive robustness!

Further, we performed a comparison between magnitude squared coherenace based clustering and the speaker embeddings based clustering . There we found that both methods, at least for speakers that are decently separated, perform comparitevely. Additionally, in order to fulfill potential bandwidth limitation, we encoded the signals with the LC3 codec (used in for example bluetooth low energy) before calculating the coherence, where we found that the clustering did not change drastically. Although sending all encoded signals still require a bigger bandwidth than the embeddings, having the signals at the central processing unit is necesairy for lots of subsequent processing like speech enhancement, making the transmission prudent.

Speech Separation

We have proven that these improved clusters are also benifical for speaker separation. With the clusters, we know which microphones have similar content and which microphones can help suppress the interference and noise. the separations scheme first selects a reference microphone for each cluster. By sharing the information of these references between clusters, the interfering and target source signal can be compared and the unwanted component is suppress with a mask based approach. appying that mask to each in cluster microphone provides high SIR signals, ideal to estimate the relative delay (both due to time difference of arrival and clock offsets). Compensating for the delays with the delay and sum beamformer delivers a results with fewer artifacts than the mask based enhancement, while also suppressing the interferrer. Finaly, a postfilter on these delay and sum beamformed signals by again comparing the resutls between clusters gives the best resutls. Results of the clusters and the separation resutls are here for shoe-box accoustics simulated rooms and realistic CATT model generated room impulse responces .
WASN generated image