检索结果-内蒙古大学图书馆

HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit Bert for Robust Speech Recognition

学校读者我要写书评

暂无评论

HuBERT-AGG: Aggregated Representation Distillation of Hidden...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Wei Wang Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Self-supervised learning (SSL) has attracted widespread research interest since many successful SSL approaches such as wav2vec 2.0 and Hidden-unit BERT (HuBERT) have achieved promising results on speech-related tasks such as automatic speech recognition (ASR). However, few works have been conducted to improve the noise robustness of SSL models. In this paper, we propose HuBERT-AGG, a novel method that learns noise-invariant SSL representations for robust speech recognition by distilling aggregated layer-wise representations. Specifically, we learn an aggregator that computes the weighted sum of all hidden states of a pretrained vanilla Hu-BERT by fine-tuning it on a small portion of labeled data. Then a noise-robust HuBERT is trained on the simulated noisy speech by distilling from the aggregated representations and layer-wise hidden states produced by a pretrained vanilla HuBERT with parallel original speech as input. Experiments on libriSpeech simulated noisy test sets show 13.1%-17.0% relative word error rate (WER) reduction with very slight degradation on the original test sets. On CHiME-4 1-channel real speech test sets, we have surpassed the best results achieved by all published fully supervised ASR models as well as other SSL approaches adopting the same data usage as ours.

关键词： Training Computational modeling Supervised learning Self-supervised learning Signal processing Encoding Noise robustness

EXPRESSIVE TTS DRIVEN BY NATURAL LANGUAGE PROMPTS USING FEW HUMAN ANNOTATIONS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Hanglei Guo, Yiwei Liu, Sen Chen, Xie Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS’s proficiency in leveraging LLM’s semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles. Copyright © 2023, The Authors. All rights reserved.

关键词： Speech synthesis

Diverse and Vivid Sound Generation from Text Descriptions

学校读者我要写书评

暂无评论

Diverse and Vivid Sound Generation from Text Descriptions

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Guangwei Li Xuenan Xu Lingfeng Dai Mengyue Wu Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-Lance Lab Shanghai Jiao Tong University Shanghai China

Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher possibility and complexity on the audio to be generated. A Variation-Quantized GAN is used to train a codebook learning discrete representations of spectrograms. For a given text description, its pre-trained embedding is fed to a Transformer to sample codebook indices to decode a spectrogram to be further transformed into waveform by a melgan vocoder. The generated waveform has high quality and fidelity while excellently corresponding to the given text. Experiments show that our proposed method is capable of generating natural, vivid audios, achieving superb quantitative and qualitative results.

关键词： Measurement Visualization Vocoders Natural languages Transformers SPICE Complexity theory

Adaptive Large Margin Fine-Tuning For Robust Speaker Verification

学校读者我要写书评

暂无评论

Adaptive Large Margin Fine-Tuning For Robust Speaker Verific...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Leying Zhang Zhengyang Chen Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Large margin fine-tuning (LMFT) is an effective strategy to improve the speaker verification system’s performance and is widely used in speaker verification challenge systems. Because the large margin in the loss function could make the training task too difficult, people usually use longer training segments to alleviate this problem in LMFT. However, the LMFT model could have a duration mismatch with the real scenario verification, where the verification speech may be very short. In our experiments, we also find that LMFT fails in short duration and other verification scenarios. To solve this problem, we propose the duration-based and similarity-based adaptive large margin fine-tuning (ALMFT) strategy. To verify its effectiveness, we constructed fixed, variable length, and asymmetric verification trials based on VoxCeleb1. Experimental results demonstrate that ALMFT algorithms are very effective and robust, which not only achieve comparable improvement with LMFT in official VoxCeleb evaluation trials but also overcome performance degradation problems in short-duration and asymmetric scenarios respectively.

关键词： Training Degradation Adaptation models Signal processing algorithms Focusing Collaboration Signal processing

MULTI-SPEAKER MULTI-LINGUAL VQTTS SYSTEM FOR LIMMITS 2023 CHALLENGE

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Du, Chenpeng Guo, Yiwei Shen, Feiyu Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours data from each speaker can be selected to train the TTS model. Our system is based on the recently proposed VQTTS that utilizes VQ acoustic feature rather than mel-spectrogram. We introduce additional speaker embeddings and language embeddings to VQTTS for controlling the speaker and language information. In the cross-lingual evaluations where we need to synthesize speech in a cross-lingual speaker’s voice, we provide a native speaker’s embedding to the acoustic model and the target speaker’s embedding to the vocoder. In the subjective MOS listening test on naturalness, our system achieves 4.77 which ranks first. Copyright © 2023, The Authors. All rights reserved.

关键词： Subjective testing

Improving Dino-Based Self-Supervised Speaker Verification with Progressive Cluster-Aware Training

学校读者我要写书评

暂无评论

Improving Dino-Based Self-Supervised Speaker Verification wi...

Acoustics, Speech, and Signal Processing Workshops (ICASSPW), IEEE International Conference on

作者： Bing Han Wen Huang Zhengyang Chen Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Self-supervised contrastive learning has recently emerged as one of the promising approaches in speaker verification task, due to its indep.ndence from labeled data. Among them, the DINO-based self-supervised framework, trained without exploiting negative pairs, is very popular and achieves excellent performance in the speaker verification task. However, limited by the duration of utterance, there exist many overlaps which may mislead the model to pay attention to irrelevant information. To tackle this problem, we propose a cluster-aware (CA) training strategy to make the model crop positive segments from several utterances in the same cluster rather than from a single utterance. Besides, in the clustering stage, we also investigate strategies of fixed number clustering as well as progressive clustering. With these strategies, our CA-DINO achieves the state-of-the-art result on Vox-O test set. Finally, we explore the effect of fine-tuning CA-DINO with a small amount of labeled data. Our proposed model with only 10% labeled data outperforms the fully supervised system trained on all data.

关键词：

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

学校读者我要写书评

暂无评论

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Ch...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Chenpeng Du Yiwei Guo Feiyu Shen Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

关键词： Vocoders Signal processing Acoustics Data models Speech processing

On the Structural Generalization in Text-to-SQL

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Li, Jieyu Chen, Lu Cao, Ruisheng Zhu, Su Xu, Hongshen Chen, Zhi Zhang, Hanchong Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence Ai Institute Shanghai Jiao Tong University Shanghai China

Exploring the generalization of a text-to-SQL parser is essential for a system to automatically adapt the real-world databases. Previous works provided investigations focusing on lexical diversity, including the influence of the synonym and perturbations in both natural language questions and databases. However, research on the structure variety of database schema (DS) is deficient. Specifically, confronted with the same input question, the target SQL is probably represented in different ways when the DS comes to a different structure. In this work, we provide in-deep discussions about the structural generalization of text-to-SQL tasks. We observe that current datasets are too templated to study structural generalization. To collect eligible test data, we propose a framework to generate novel text-to-SQL data via automatic and synchronous (DS, SQL) pair altering. In the experiments, significant performance reduction when evaluating well-trained text-to-SQL models on the synthetic samples demonstrates the limitation of current research regarding structural generalization. According to comprehensive analysis, we suggest the practical reason is the overfitting of (NL, SQL) patterns. Copyright © 2023, The Authors. All rights reserved.

关键词： Database systems

ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Cao, Ruisheng Zhang, Hanchong Xu, Hongshen Li, Jieyu Ma, Da Chen, Lu Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China

Text-to-SQL aims to generate an executable SQL program given the user utterance and the corresponding database schema. To ensure the well-formedness of output SQLs, one prominent approach adopts a grammar-based recurrent decoder to produce the equivalent SQL abstract syntax tree (AST). However, previous methods mainly utilize an RNN-series decoder, which 1) is time-consuming and inefficient and 2) introduces very few structure priors. In this work, we propose an AST structure-aware Transformer decoder (ASTormer) to replace traditional RNN cells. The structural knowledge, such as node types and positions in the tree, is seamlessly incorporated into the decoder via both absolute and relative position embeddings. Besides, the proposed framework is compatible with different traversing orders even considering adaptive node selection. Extensive experiments on five text-to-SQL benchmarks demonstrate the effectiveness and efficiency of our structured decoder compared to competitive baselines. Copyright © 2023, The Authors. All rights reserved.

关键词： Decoding