版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:School of Computer Science Wuhan University Wuhan China Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems Duke Kunshan University Kunshan China Data & AI Engineering System OPPO Beijing China
出 版 物:《arXiv》 (arXiv)
年 卷 期:2022年
核心收录:
摘 要:The common target speech separation directly estimate the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation model (MTSS) to simultaneously extract each speaker s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS), which consists of a SD module and MTSS module. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves 1.38dB SDR, 1.34dB SI-SDR, and 0.13 PESQ improvements over the baseline on the WSJ0- 2mix-extr dataset, respectively. The SD-MTSS system makes 19.2% relative speaker dependent character error rate (CER) reduction on the Alimeeting dataset. Copyright © 2022, The Authors. All rights reserved.