咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Advancing Emotional Voice Conv... 收藏

Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators

作     者:Larraín, Pablo Díaz Patricio, Miguel A. Berlanga, Antonio Molina, José M. 

作者机构:Engineering Team Grupo MasMovil Madrid Spain Computer Science and Engineering Department Applied Artificial Intelligence Group Universidad Carlos III de Madrid Spain 

出 版 物:《Human-centric Computing and Information Sciences》 (Hum.-centric Comput. Inf. Sci.)

年 卷 期:2025年第15卷

页      面:1-18页

核心收录:

基  金:This study was funded by the Spanish company Grupo MasMovil the public research projects of the Spanish Ministry of Science and Innovation PID2020-118249RB-C22 and PDC2021-121567-C22-AEI/10.13039/501100011033 and the project under the call PEICTI 2021\u20132023 with the identifier TED2021-131520B-C22 

主  题:Discriminators 

摘      要:The aim of emotional voice conversion (EVC) is to alter the emotional content of spoken utterances without compromising the speaker’s identity or linguistic content. Many EVC frameworks rely on scarce parallel data recorded by actors. This paper proposes a novel framework for EVC that leverages non-parallel data through cycle consistent adversarial networks (CycleGANs). CycleGANs learn to transform input data between domains using a cycle loss that regularizes training by ensuring the reconstructed inputs match the original inputs in both domains. Despite their use in various voice conversion tasks, CycleGANs often produce audio with degraded quality, largely due to the oversmoothing of speech features. To address these issues, we devised two distinct CycleGAN-based methods within the aforementioned framework: the first method incorporates a two-step adversarial loss, while the second method enhances this by incorporating patch-based (PatchGAN) discriminators. Prior research has demonstrated that these techniques alleviate the oversmoothing of the spectrum and have shown superior capability in capturing dynamic spectral variations. In this work, we incorporate these enhancements not only to transform the spectrum, but also the fundamental frequency (F0), a speech feature that is strongly related to intonation and expression of emotion. The objective evaluation of the proposed methods shows improvements over the baseline in both Mel-cepstrum distortion and root-mean-square error, as well as in the Pearson correlation coefficient of the F0 transformation. Furthermore, subjective evaluations using the mean opinion score (MOS) and similarity MOS indicate that our model outperforms the baseline model in terms of naturalness and similarity to the target emotion. © (2025), (Korea Information Processing Society). All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分