检索结果-内蒙古大学图书馆

ACOUSTIC BPE FOR SPEECH GENERATION WITH DISCRETE TOKENS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Shen, Feiyu Guo, Yiwei Du, Chenpeng Chen, Xie Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks. Copyright © 2023, The Authors. All rights reserved.

关键词： Signal encoding

Complementary Classifier Induced Partial label Learning

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Jia, Yuheng Si, Chongjie Zhang, Min-Ling Key Laboratory of Computer Network and Information Integration School of Computer Science and Engineering Nanjing210096 China MoE Key Lab of Artificial Intelligence AI Institute Shanghai200240 China

In partial label learning (PLL), each training sample is associated with a set of candidate labels, among which only one is valid. The core of PLL is to disambiguate the candidate labels to get the ground-truth one. In disambiguation, the existing works usually do not fully investigate the effectiveness of the non-candidate label set (a.k.a. complementary labels), which accurately indicates a set of labels that do not belong to a sample. In this paper, we use the non-candidate labels to induce a complementary classifier, which naturally forms an adversarial relationship against the traditional PLL classifier, to eliminate the false-positive labels in the candidate label set. Besides, we assume the feature space and the label space share the same local topological structure captured by a dynamic graph, and use it to assist disambiguation. Extensive experimental results validate the superiority of the proposed approach against state-of-the-art PLL methods on 4 controlled UCI data sets and 6 real-world data sets, and reveal the usefulness of complementary learning in PLL. The code has been released in the link https://***/Chongjie-Si/PL-CL. Copyright © 2023, The Authors. All rights reserved.

关键词： Topology

Light-Weight Visualvoice: Neural Network Quantization On Audio Visual Speech Separation

学校读者我要写书评

暂无评论

Light-Weight Visualvoice: Neural Network Quantization On Aud...

Acoustics, Speech, and Signal Processing Workshops (ICASSPW), IEEE International Conference on

作者： Yifei Wu Chenda Li Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

As multi-modal systems show superior performance on more tasks, the huge amount of computational resources they need becomes one of the critical problems to be solved. In this work, we explore neural network quantization methods to compress the resource requirement of VisualVoice, a state-of-the-art audio-visual speech separation system. The model is firstly fine-tuned by an ADMM-based quantization-aware training approach to produce the fixed-precision quantized version. Then three strategies, including manual selection, Hessian trace-based selection and KL divergence-based greedy search are explored to find the optimal mixed-precision setting of the model. The result shows that by applying the optimal strategy, we obtain a satisfying trade-off between space, speed and performance for the final system. The KL divergence-based strategy reaches 7.2 dB in SDR at 3-bit equivalent setup, which outperforms the fixed-precision setup and the other two mixed-precision strategies. More-over, we also discuss the influence caused by quantizing different parts of the multi-modal system.

关键词：

Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR

学校读者我要写书评

暂无评论

Factorized AED: Factorized Attention-Based Encoder-Decoder f...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xun Gong Wei Wang Hang Shao Xie Chen Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

End-to-end automatic speech recognition (ASR) systems have gained popularity given their simplified architecture and promising results. However, text-only domain adaptation remains a big challenge for E2E systems. Text-to-speech (TTS) based approaches fine-tune ASR models by synthesized speech with an auxiliary TTS model, thus increase dep.oyment costs. Language model (LM) fusion based approaches can achieve good performance but are sensitive to interpolation parameters. In order to factorize out the language component in the AED model, we propose the factorized attention-based encoder-decoder (Factorized AED) model whose decoder takes as input the posterior probabilities of a jointly trained LM. Moreover, in the context of domain adaptation, the domain specific LM serves as a plug-and-play component for a well-trained factorized AED model. In-domain experiments on LibriSpeech and out-of-domain experiments adapting from LibriSpeech to a variety of domains in GigaSpeech are conducted to validate the effectiveness of our proposed methods. Results show 20% / 24% relative word error rate (WER) reduction for LibriSpeech test sets and 8 ∼34% relative WER reduction for 8 GigaSpeech target domains test sets compared to the AED baseline.

关键词： Adaptation models Interpolation Costs Error analysis Signal processing Transformers Decoding

Exploring Binary Classification Loss for Speaker Verification

学校读者我要写书评

暂无评论

Exploring Binary Classification Loss for Speaker Verificatio...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Bing Han Zhengyang Chen Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task. For existing loss functions, metric learning-based objectives dep.nd strongly on searching effective pairs which might hinder further improvements. And popular multi-classification methods are usually observed with degradation when evaluated on unseen speakers. In this work, we introduce SphereFace2 framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification. Benefiting from this learning paradigm, it can efficiently alleviate the gap between training and evaluation. Experiments conducted on Voxceleb show that the SphereFace2 outperforms other existing loss functions, especially on hard trials. Besides, large margin fine-tuning strategy is proven to be compatible with it for further improvements. Finally, SphereFace2 also shows its strong robustness to class-wise noisy labels which has the potential to be applied in the semi-supervised training scenario with inaccurate estimated pseudo labels.

关键词： Training Degradation Signal processing Search problems Robustness Acoustics Noise measurement

DIFFVOICE: TEXT-TO-SPEECH WITH LATENT DIFFUSION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Liu, Zhijun Guo, Yiwei Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

关键词： Speech synthesis

VOICEFLOW: EFFICIENT TEXT-TO-SPEECH WITH RECTIFIED FLOW MATCHING

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Guo, Yiwei Du, Chenpeng Ma, Ziyang Chen, Xie Yu, Kai X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence AI Institute Shanghai China

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow. Copyright © 2023, The Authors. All rights reserved.

关键词： Graphic methods

Code-Switching Text Generation and Injection in Mandarin-English ASR

学校读者我要写书评

暂无评论

Code-Switching Text Generation and Injection in Mandarin-Eng...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Haibin Yu Yuxuan Hu Yao Qian Ma Jin Linquan Liu Shujie Liu Yu Shi Yanmin Qian Edward Lin Michael Zeng Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Microsoft Corporation

Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate codeswitching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set.

关键词： Training Learning systems Industries Speech coding Error analysis Signal processing Transformers

EXPLORING BINARY CLASSIFICATION LOSS FOR SPEAKER VERIFICATION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Han, Bing Chen, Zhengyang Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

关键词： Speech recognition