检索结果-内蒙古大学图书馆

Leveraging Low-Rank Adaptation for Parameter-Efficient Fine-Tuning in Multi-Speaker Adaptive text-to-speech synthesis

引用

IEEE ACCESS 2024年 12卷 190711-190727页

作者： Hong, Changi Lee, Jung Hyuk Kim, Hong Kook Gwangju Inst Sci & Technol AI Grad Sch Gwangju 61005 South Korea Gwangju Inst Sci & Technol Sch Elect Engn & Comp Sci Gwangju 61005 South Korea AunionAI Co Ltd Gwangju 61005 South Korea

text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.

关键词： Adaptation models Predictive models Computational modeling Acoustics Training Data models Tuning text to speech Load modeling Zero shot learning text-to-speech synthesis low-rank adaptation multi-speaker adaptation parameter-efficient fine-tuning residual adapter conditional layer normalization variational inference with adversarial learning

来源：评论

学校读者我要写书评

暂无评论

ENHANCING LOW-RESOURCE SPOKEN LANGUAGE IDENTIFICATION VIA CROSS-MODALITY RETRIEVAL AND CROSS-LINGUAL text-to-speech synthesis

ENHANCING LOW-RESOURCE SPOKEN LANGUAGE IDENTIFICATION VIA CR...

引用

2024 Spoken Language Technology Workshop

作者： Ma, Min Wang, Gary Kastner, Kyle Caswell, Isaac Yoon, Charles Rosenberg, Andrew Google Mountain View CA 94043 USA

ISBN: (纸本)9798350392265;9798350392258

Spoken language identification (SLID) for low-resource languages remains challenging due to limited data availability. In this paper, we present two novel approaches to address the issue: cross-modality retrieval-based data selection and cross-lingual text-to-speech (TTS) based data augmentation. Incorporating semi-supervised speech and synthetic speech produced by the two methods, we successfully enhance SLID on low-resource languages and on the full set of target languages, at a publicly available YouTube-derived dataset. Our best recipe reduces training data amount by 28% and ensures a more balanced distribution of training data across languages. The two general frameworks offer innovative strategies for leveraging resources to add valuable data to enhance SLID in extremely low-resource scenarios.

关键词： spoken language identification low-resource languages cross-modality retrieval text-to-speech synthesis data augmentation

来源：评论

学校读者我要写书评

暂无评论

text-to-speech synthesis using spectral modeling based on non-negative autoencoder 23

Text-to-speech synthesis using spectral modeling based on no...

引用

Interspeech Conference

作者： Gorai, Takeru Saito, Daisuke Minematsu, Nobuaki Univ Tokyo Tokyo Japan

This paper proposes a statistical parametric speech synthesis system that uses non-negative autoencoder (NAE) for spectral modeling. NAE is a model that extends non-negative matrix factorization (NMF) as neural networks. In the proposed method, we employ latent variables in NAE as acoustic features. Reconstruction of spectral information and estimation of latent variables are simultaneously trained. The non-negativity of latent variables in NAE is expected to contribute to dimensionality reduction such that the fine structure of the spectral envelopes is preserved. Experimental results demonstrates the effectiveness of the proposed framework. We also study multi-speaker modeling where each of NAEs corresponds to each single speaker. In addition, a neural source-filter (NSF) model was applied to the waveform generation. When a neural vocoder is trained with natural acoustic features and tested with synthesized features, quality degradation occurs due to the mismatch between training and test data. In order to mitigate the mismatch, this system uses features obtained by reconstructing natural speech using NAE for training. Experimental results show that reconstructed features are similar to synthesized features, and as a result, the quality of the synthesized speech is improved.

关键词： text-to-speech synthesis non-negative autoencoder neural source-filter

来源：评论

学校读者我要写书评

暂无评论

SOCODEC: A SEMANTIC-ORDERED MULTI-STREAM speech CODEC FOR EFFICIENT LANGUAGE MODEL BASED text-to-speech synthesis

SOCODEC: A SEMANTIC-ORDERED MULTI-STREAM SPEECH CODEC FOR EF...

引用

2024 Spoken Language Technology Workshop

作者： Guo, Haohan Xie, Fenglong Xie, Kun Yang, Dongchao Guo, Dake Wu, Xixin Meng, Helen Chinese Univ Hong Kong Hong Kong Peoples R China Xiaohongshu Inc Shanghai Peoples R China Northwestern Polytech Univ Xian Peoples R China

ISBN: (纸本)9798350392265;9798350392258

The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speech into a shorter, multi-stream discrete semantic sequence with multiple tokens at each frame. Meanwhile, the ordered product quantization is proposed to constrain this sequence into an ordered representation. It can be applied with a multi-stream delayed LM to achieve better autoregressive generation along both time and stream axes in TTS. The experimental result strongly demonstrates the effectiveness of the proposed approach, achieving superior performance over baseline systems even if compressing the frameshift of speech from 20ms to 240ms (12x). The ablation studies further validate the importance of learning the proposed ordered multi-stream semantic representation in pursuing shorter speech sequences for efficient LM-based TTS.

关键词： speech Codec speech Language Model text-to-speech synthesis Vector Quantization Representation Learning

来源：评论

学校读者我要写书评

暂无评论

StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot text-to-speech synthesis 7th

StyleFusion TTS: Multimodal Style-Control and Enhanced Featu...

引用

7th Chinese Conference on Pattern Recognition and Computer Vision

作者： Chene, Zhiyong Li, Xinnuo Ai, Zhiqi Xu, Shugong Shanghai Univ Sch Commun & Informat Engn Shanghai Peoples R China

ISBN: (纸本)9789819787944;9789819787951

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs-including text prompts, audio references, and speaker timbre references-in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis. A project website provides detailed information for demonstration and reproduction.

关键词： text-to-speech synthesis Voice cloning Zero-shot learning Multimodal learning

来源：评论

学校读者我要写书评

暂无评论

FLY-TTS: Fast, Lightweight and High-Quality End-to-End text-to-speech synthesis 25

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-...

引用

25th Interspeech Conference

作者： Guo, Yinlin Lv, Yening Dou, Jinqiao Zhang, Yan Wang, Yuehai Zhejiang Univ Coll Informat Sci & Elect Engn Hangzhou Peoples R China

While recent advances in text-to-speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms;2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model;3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.(1)

关键词： text-to-speech synthesis fast lightweight high quality

来源：评论

学校读者我要写书评

暂无评论

Improving Accented speech Recognition using Data Augmentation based on Unsupervised text-to-speech synthesis 32

Improving Accented Speech Recognition using Data Augmentatio...

引用

32nd European Signal Processing Conference (EUSIPCO)

作者： Cong-Thanh Do Imai, Shuhei Doddipatla, Rama Hain, Thomas Toshiba Res Europe Cambridge England Tohoku Univ Sendai Miyagi Japan Univ Sheffield Sheffield S Yorkshire England

ISBN: (纸本)9789464593617;9798331519773

This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.

关键词： Accented speech recognition text-to-speech synthesis data augmentation self-supervised learning Wav2vec2.0

来源：评论

学校读者我要写书评

暂无评论

Retrieval Augmented Generation in Prompt-based text-to-speech synthesis with Context-Aware Contrastive Language-Audio Pretraining 25

Retrieval Augmented Generation in Prompt-based Text-to-Speec...

引用

25th Interspeech Conference

作者： Xue, Jinlong Deng, Yayue Gao, Yingming Li, Ya Beijing Univ Posts & Telecommun Beijing Peoples R China

Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods.

关键词： text-to-speech synthesis contrastive learning retrieval augmentation

来源：评论

学校读者我要写书评

暂无评论

The Sound of Language: A Bilingual Analysis of Voice Conversion and text-to-speech synthesis

The Sound of Language: A Bilingual Analysis of Voice Convers...

引用

2025 IEEE International Conference on Acoustics, speech, and Signal Processing, ICASSP 2025

作者： Choi, Jeong-Eun Schäfer, Karla Steinebach, Martin Fraunhofer SIT ATHENE Darmstadt Germany

ISBN: (纸本)9798350368741

With the rise of audio deepfakes, there is an increasing need for comprehensive studies on their generation methods, especially regarding their quality. Areas such as languages beyond English and Chinese, as well as comparisons between voice conversion (VC) and text-to-speech synthesis (TTS), remain underexplored. In our study, we generated samples in English and German using 10 recent VC and TTS methods, including two publicly accessible online tools. We compared these samples using various evaluation methods to gain insights into their quality across different factors. Our analysis indicates that TTS performs slightly better than VC, with minor differences between English and German data. Interestingly, in VC, the gender of the source speaker has minimal influence on the generated samples. Instead, the cross-gender factor appears to affect VC. For both VC and TTS, the target speaker samples used for generation seem to influence the quality of the generated samples. © 2025 IEEE.

关键词： Audio Deepfakes English German text-to-speech synthesis Voice conversion

来源：评论

学校读者我要写书评

暂无评论

ZET-speech: Zero-shot adaptive Emotion-controllable text-to-speech synthesis with Diffusion and Style-based Models 24

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-...

引用

Interspeech Conference

作者： Kang, Minki Han, Wooseok Hwang, Sung Ju Yang, Eunho AITRICS Seoul South Korea Korea Adv Inst Sci & Technol Daejeon South Korea

Emotional text-To-speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https: //ZET- ***/ZET-speech-Demo/.

关键词： text-to-speech synthesis Emotional TTS

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：