版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Key Laboratory of Intelligent Computing in Medical Image Northeastern University Shenyang110819 China Shenzhen Guangdong518000 China Machine Listening Lab University of Bremen Bremen28359 Germany Department of Electrical and Computer Engineering National University of Singapore Singapore119077 Singapore
出 版 物:《Journal of Shanghai Jiaotong University (Science)》 (J. Shanghai Jiaotong Univ. Sci.)
年 卷 期:2025年
页 面:1-6页
核心收录:
基 金:Foundation item: the Deutsche Forschungsgemeinschaft (DFG German Research Foundation) under Germany\u2019s Excellence Strategy (University Allowance EXC 2077 University of Bremen) the National Natural Science Foundation of China (Nos. 62401377 and 62271432) and the Internal Project of Shenzhen Research Institute of Big Data (No. T00120220002)
主 题:Speech recognition
摘 要:Target speaker extraction (TSE) models are expected to extract the target speech from a cocktail party mixture signal. When only trained with present target speaker samples (PT), these models output noise in the absence of the target speaker (AT). One may enhance the TSE quality by providing the information about the PT and AT. However, the detection of the target speaker is not perfect. In this paper, we propose a new model, TSEV, which performs target speaker extraction and speaker verification simultaneously. The TSEV model outputs an extracted speech and generates two speaker embeddings per inference to detect the target speaker. By sharing the speaker encoder and low-level modules, the speaker verification task can be performed in low signal-to-noise ratio scenarios. We train the TSEV model on multi-talker PT and AT conditions with fully overlapped speech. Experiments verify the superiority of jointly performing two tasks in the proposed model. The TSEV model achieves better verification performance without degrading the extraction performance compared with the baseline. © Shanghai Jiao Tong University 2025.