文献详情 >Advancing Emotional Voice Conv... 收藏

Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators

作者：Larraín, Pablo Díaz Patricio, Miguel A. Berlanga, Antonio Molina, José M.

作者机构：Engineering Team Grupo MasMovil Madrid Spain Computer Science and Engineering Department Applied Artificial Intelligence Group Universidad Carlos III de Madrid Spain

出版物：《Human-centric Computing and Information Sciences》 (Hum.-centric Comput. Inf. Sci.)

年卷期：2025年第15卷

页面：1-18页

核心收录：

基　　金：This study was funded by the Spanish company Grupo MasMovil the public research projects of the Spanish Ministry of Science and Innovation PID2020-118249RB-C22 and PDC2021-121567-C22-AEI/10.13039/501100011033 and the project under the call PEICTI 2021\u20132023 with the identifier TED2021-131520B-C22

主　　题：Discriminators

摘要：The aim of emotional voice conversion (EVC) is to alter the emotional content of spoken utterances without compromising the speaker’s identity or linguistic content. Many EVC frameworks rely on scarce parallel data recorded by actors. This paper proposes a novel framework for EVC that leverages non-parallel data through cycle consistent adversarial networks (CycleGANs). CycleGANs learn to transform input data between domains using a cycle loss that regularizes training by ensuring the reconstructed inputs match the original inputs in both domains. Despite their use in various voice conversion tasks, CycleGANs often produce audio with degraded quality, largely due to the oversmoothing of speech features. To address these issues, we devised two distinct CycleGAN-based methods within the aforementioned framework: the first method incorporates a two-step adversarial loss, while the second method enhances this by incorporating patch-based (PatchGAN) discriminators. Prior research has demonstrated that these techniques alleviate the oversmoothing of the spectrum and have shown superior capability in capturing dynamic spectral variations. In this work, we incorporate these enhancements not only to transform the spectrum, but also the fundamental frequency (F0), a speech feature that is strongly related to intonation and expression of emotion. The objective evaluation of the proposed methods shows improvements over the baseline in both Mel-cepstrum distortion and root-mean-square error, as well as in the Pearson correlation coefficient of the F0 transformation. Furthermore, subjective evaluations using the mean opinion score (MOS) and similarity MOS indicate that our model outperforms the baseline model in terms of naturalness and similarity to the target emotion. © (2025), (Korea Information Processing Society). All rights reserved.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：