检索结果-内蒙古大学图书馆

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Pengfei Cai Yan Song Nan Jiang Qing Gu Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model (PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the leaning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique.

关键词： Representation learning Event detection Computational modeling Prototypes Signal processing algorithms Self-supervised learning Signal processing Transformers Data models speech processing

来源：评论

学校读者我要写书评

暂无评论

Regularizing Contrastive Predictive Coding for speech Applications

arXiv

引用

arXiv 2023年

作者： Bhati, Saurabhchand Villalba, Jesús Zelasko, Piotr Moro-Velazquez, Laureano Dehak, Najim Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States Meaning.Team Inc United States

Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly improved the quality of the unsupervised representations. These representations significantly reduce the amount of labeled data needed for downstream task performance, such as automatic speech recognition. CPC learns representations by learning to predict future frames given current frames. Based on the observation that the acoustic information, e.g., phones, changes slower than the feature extraction rate in CPC, we propose regularization techniques that impose slowness constraints on the features. Here we propose two regularization techniques: Self-expressing constraint and Left-or-right regularization. We evaluate the proposed model on ABX and linear phone classification tasks, acoustic unit discovery, and automatic speech recognition. The regularized CPC trained on 100 hours of unlabeled data matches the performance of the baseline CPC trained on 360 hours of unlabeled data. We also show that our regularization techniques are complementary to data augmentation and can further boost the system's performance. In monolingual, cross-lingual, or multilingual settings, with/without data augmentation, regardless of the amount of data used for training, our regularized models outperformed the baseline CPC models. Copyright © 2023, The Authors. All rights reserved.

关键词： Supervised learning

来源：评论

学校读者我要写书评

暂无评论

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-speech Synthesis

Incremental Disentanglement for Environment-Aware Zero-Shot ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Ye-Xin Lu Hui-Peng Du Zheng-Yan Sheng Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned into the decoder for environment-aware speech generation. Experimental results demonstrate that IDEA-TTS achieves superior performance in the environment-aware TTS task, excelling in speech quality, speaker similarity, and environmental similarity. Additionally, IDEA-TTS is also capable of the acoustic environment conversion task and achieves state-of-the-art performance.

关键词： speech enhancement Acoustics Text to speech Decoding Data mining Spectrogram

来源：评论

学校读者我要写书评

暂无评论

Towards robust one-shot voice conversion with cycle phonetic posteriorgrams and multi-scale speaker representations 24

Towards robust one-shot voice conversion with cycle phonetic...

引用

24th International Congress on Acoustics, ICA 2022

作者： Chen, Yannian Liu, Lijuan Hu, Yajun Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China IFLYTEK Research IFLYTEK Co. Ltd. China

One-shot voice conversion (VC) aims to convert the voice across arbitrary speakers even unseen during training, with only one reference utterance from the target speaker. It is still a challenging task as both content and speaker representations estimated from speech are required to be reliable. In this paper, we propose a novel method which combines phonetic posteriorgrams (PPGs) and multi-scale speaker representations to achieve robust one-shot VC. PPGs are extracted by a pretrained automatic speech recognition (ASR) model and contain robust linguistic information. Cycle PPGs which are generated from a cycle conversion process are used for training to eliminate the influence of residual speaker information in PPGs. Furthermore, multi-scale speaker representations composed of global and local ones are utilized. Global speaker representations are modeled by an advanced speaker embedding network which integrates squeeze-excitation blocks and attentive statistics pooling to get utterance-level vectors. In order to extract time-varying and content-dependent local speaker representations, an attention mechanism is adopted to select the most suitable features depending on each content frame, which is expected to refine the coarse speaker information given by utterance-level speaker representations. Experimental results showed that the proposed method outperformed baseline methods on one-shot VC. © 2022 Proceedings of the International Congress on Acoustics. All Rights Reserved.

关键词： speech recognition

来源：评论

学校读者我要写书评

暂无评论

Document-Level Machine Translation with Effective Batch-Level Context Representation

Document-Level Machine Translation with Effective Batch-Leve...

引用

International Joint Conference on Neural Networks (IJCNN)

作者： Kang Zhong Jie Zhang Wu Guo National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China

ISBN: (数字)9798350359312

ISBN: (纸本)9798350359329

It is critical to provide inter-sentential context for document-level neural machine translation (DocNMT) to achieve higher-quality translations. As the document-level information is naturally preserved in mini-batches in case sentences are not shuffled, in this work we propose an effective batch-level context representation (EBCR) for DocNMT by leveraging structural contextual clues in the mini-batches. The EBCR is a plug-in module that is added to each encoder layer of the conventional Transformer model and can condense the inter-sentential contextual information within the mini-batch and reinforce the inter-sentential local context through gating operation. The proposed method is evaluated on three English-German document translation datasets, and results show that our model can present the wide-range context more effectively than existing methods.

关键词： Neural networks Neural machine translation Transformers Context modeling

来源：评论

学校读者我要写书评

暂无评论

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval

Multiscale Matching Driven by Cross-Modal Similarity Consist...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qian Wang Jia-Chen Gu Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R.China

Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.

关键词：

来源：评论

学校读者我要写书评

暂无评论

PNP-RKD: A Positive-Negative Pair based Relational Knowledge Distillation Method for Cross-Domain Speaker Verification

PNP-RKD: A Positive-Negative Pair based Relational Knowledge...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qing Gu Yan Song Nan Jiang Pengfei Cai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China ICT Cluster Singapore Institute of Technology Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Existing deep embedding learning based speaker verification (SV) methods suffer from performance degradation under domain shift conditions. This can be alleviated through unsupervised domain adaptation (UDA) techniques. While UDA improves global statistical consistency across domains, discriminative information may be overlooked or misaligned in the process. To combat this, we propose PNP-RKD, a relational knowledge distillation method that utilizes positive and negative pairs from both the source and target domains within a multitask learning framework. Two auxiliary tasks are conducted separately in the source and target domains to support PNP-RKD. Embeddings are learned in a supervised fashion from the labeled source domain, providing a robust foundation of prior knowledge. For the unlabeled target domain, we apply contrastive learning based on swapped prediction, a key component that enhances noise robustness and improves the quality of learned prototypes. More importantly, it facilitates reliable sampling in PNP-RKD, thereby enhancing the alignment of discriminative knowledge across domains. Extensive experiments conducted on the NIST SRE16 and SRE18 datasets demonstrate the superior performance of the proposed PNP-RKD method, achieving EERs of 6.83% and 8.28%, respectively.

关键词： Degradation Prototypes Contrastive learning NIST Signal processing Multitasking Acoustics Noise robustness Reliability speech processing

来源：评论

学校读者我要写书评

暂无评论

Considering Temporal Connection between Turns for Conversational speech Synthesis

Considering Temporal Connection between Turns for Conversati...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Kangdi Mei Zhaoci Liu Huipeng Du Hengyu Li Yang Ai Liping Chen Zhenhua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R. China

Conversational speech synthesis aims to synthesize speech of an individual speaker based on history conversation. However, most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered. To complete this task, an acoustic model is proposed which leverages multi-modal (including text and speech) information from previous turn to predict the acoustic features of not only current turn but also the inter-turn gap. The model is designed based on MQTTS and incorporates the global acoustic representation and BERT-based local semantic representation of previous turn when predicting the acoustic features of each frame. Experimental results demonstrate that with the introduction of global acoustic information and local semantic information, our model achieves better performance on the temporal connection between turns and the quality of synthetic speech. Audio samples can be found in https://***/icassp2024.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

arXiv

引用

arXiv 2024年

作者： Cai, Pengfei Song, Yan Jiang, Nan Gu, Qing McLoughlin, Ian National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model (PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique. © 2024, CC BY.

关键词： Self-supervised learning

来源：评论

学校读者我要写书评

暂无评论

Clustering Unsupervised Representations as Defense Against Poisoning Attacks on speech Commands Classification System

Clustering Unsupervised Representations as Defense Against P...

引用

IEEE Workshop on Automatic speech Recognition and Understanding

作者： Thomas Thebaud Sonal Joshi Henry Li Martin Sustek Jesús Villalba Sanjeev Khudanpur Najim Dehak Center for Language and Speech Processing Johns Hopkins University USA Faculty of Information Technology Brno University of Technology Czechia

Poisoning attacks entail attackers intentionally tampering with training data. In this paper, we consider a dirty-label poisoning attack scenario on a speech commands classification system. The threat model assumes that certain utterances from one of the classes (source class) are poisoned by superimposing a trigger on it, and its label is changed to another class selected by the attacker (target class). We propose a filtering defense against such an attack. First, we use DIstillation with NO labels (DINO) to learn unsupervised representations for all the training examples. Next, we use K-means and LDA to cluster these representations. Finally, we keep the utterances with the most repeated label in their cluster for training and discard the rest. For a 10% poisoned source class, we demonstrate a drop in attack success rate from 99.75% to 0.25%. We test our defense against a variety of threat models, including different target and source classes, as well as trigger variations.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：