咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >TTSlow: Slow Down Text-to-Spee... 收藏
IEEE Transactions on Audio, Speech and Language Processing

TTSlow: Slow Down Text-to-Speech With Efficiency Robustness Evaluations

作     者:Xiaoxue Gao Yiming Chen Xianghu Yue Yu Tsao Nancy F. Chen 

作者机构:Institute for Infocomm Research Agency for Science Technology and Research (A*STAR) Singapore Department of Electrical and Computer Engineering National University of Singapore Singapore Research Center for Information Technology Innovation Academia Sinica Taipei Taiwan Department of Electrical Engineering Chung Yuan Christian University Taoyuan City Taiwan 

出 版 物:《IEEE Transactions on Audio, Speech and Language Processing》 

年 卷 期:2025年第33卷

页      面:693-704页

基  金:National Research Foundation, Singapore, through AI Singapore Programme National Research Foundation Singapore InfocommMedia Development Authority, Singapore National Large Language Models Funding Initiative 

主  题:Perturbation methods Speech processing Robustness Computational modeling Security Optimization Text to speech Speech recognition Predictive models Real-time systems 

摘      要:Text-to-speech (TTS) has been extensively studied for generating high-quality speech with textual inputs, playing a crucial role in various real-time applications. For real-world deployment, ensuring stable and timely generation in TTS models against minor input perturbations is of paramount importance. Therefore, evaluating the robustness of TTS models against such perturbations, commonly known as adversarial attacks, is highly desirable. In this paper, we propose TTSlow, a novel adversarial approach specifically tailored to slow down the speech generation process in TTS systems. To induce long TTS waiting time, we design novel efficiency-oriented adversarial loss to encourage endless generation process. TTSlow encompasses two attack strategies targeting both text inputs and speaker embedding. Specifically, we propose TTSlow-text, which utilizes a combination of homoglyphs-based and swap-based perturbations, along with TTSlow-spk, which employs a gradient optimization attack approach for speaker embedding. TTSlow serves as the first attack approach targeting a wide range of TTS models, including autoregressive and non-autoregressive TTS ones, thereby advancing exploration in audio security. Extensive experiments are conducted to evaluate the inference efficiency of TTS models, and in-depth analysis of generated speech intelligibility is performed using Gemini. The results demonstrate that TTSlow can effectively slow down two TTS models across three publicly available datasets.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分