Nowadays deep neural networks are a common choice for multichannel speech processing as they may outperform the traditional concatenation of a linear beamformer and a post-filter in challenging scenarios. To obtain st...
详细信息
ISBN:
(纸本)9798350361865;9798350361858
Nowadays deep neural networks are a common choice for multichannel speech processing as they may outperform the traditional concatenation of a linear beamformer and a post-filter in challenging scenarios. To obtain strong spatial selectivity, these approaches are typically trained for a specific microphone array configuration. However, it was recently shown that such models are sensitive even to small perturbations in the microphones placements. In this paper we propose a method for handling variable array configurations based on model-agnostic meta-learning. We demonstrate that the proposed approach increases robustness to changes in the array configurations, i.e., mismatched conditions, while maintaining the same performance as the array-specific model on matched conditions.
Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial cov...
详细信息
Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASAmodule against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training.
暂无评论