咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Conversational Speech Recognit... 收藏
arXiv

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

作     者:Wei, Kun Li, Bei Lv, Hang Lu, Quan Jiang, Ning Xie, Lei 

作者机构:The Audio Speech and Language Processing Group School of Computer Science Northwestern Polytechnical University Xi'An710072 China The School of Computer Science and Engineering Northeastern University Shenyang110167 China Mashang Consumer Finance Co. Ltd. Chongqing401121 China 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2023年

核心收录:

主  题:Signal encoding 

摘      要:Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel Conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational-level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model. Copyright © 2023, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分