版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Graduate School of Informatics Kyoto University Kyoto Japan Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China
出 版 物:《IEEE Transactions on Audio, Speech and Language Processing》
年 卷 期:2025年第33卷
页 面:472-482页
基 金:JSPS KAKENHI
主 题:Training Adaptation models Translation Neural machine translation Data models Speech processing
摘 要:Pretrained models have taken full advantage of monolingual corpora and achieved impressive results in training Unsupervised Neural Machine Translation (UNMT) models. However, when adapting UNMT models with in-domain monolingual corpora for domain-specific translation tasks, one of the languages may lack in-domain corpora, resulting in the unequal amount and proportion of in-domain monolingual corpora in each language. This problem situation is known as Domain Mismatch (DM). This study investigates the impact of DM in UNMT. We find that DM causes a translation quality disparity. That is, while in-domain monolingual corpora of a language can enhance the in-domain translation quality into that particular language, this enhancement cannot be generalized to the other language, and the translation quality into the other language remains deficient. To address this problem, we propose Domain-Aware Adaptation (DAA), which can be embedded in the vanilla UNMT model training process. By passing sentence-level domain information to the model during training and inference, DAA gives higher weight to in-domain data from open-domain corpora related to specific domains to alleviate domain mismatch. The experimental results on German-English and Romanian-English translation tasks specified in the IT, Koran, medical, and TED2020 domains demonstrate that DAA can efficiently exploit open-domain corpora to mitigate the quality disparity of translation caused by DM.