检索结果-内蒙古大学图书馆

Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Gong, Xun Lu, Yizhou Zhou, Zhikai Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Accent variability has posed a huge challenge to automatic speech recognition (ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12% and 10% relative word error rate (WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

Divide and Conquer: a Two-Step Method for High Quality Face De-identification with Model Explainability

学校读者我要写书评

暂无评论

Divide and Conquer: a Two-Step Method for High Quality Face ...

International Conference on computer Vision (ICCV)

作者： Yunqian Wen Bo Liu Jingyi Cao Rong Xie Li Song Institute of Image Communication and Network Engineering Shanghai Jiao Tong University School of Computer Science University of Technology Sydney MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University

Face de-identification involves concealing the true identity of a face while retaining other facial characteristics. Current target-generic methods typically disentangle identity features in the latent space, using adversarial training to balance privacy and utility. However, this pattern often leads to a trade-off between privacy and utility, and the latent space remains difficult to explain. To address these issues, we propose IDeudemon, which employs a "divide and conquer" strategy to protect identity and preserve utility step by step while maintaining good explainability. In Step I, we obfuscate the 3D disentangled ID code calculated by a parametric NeRF model to protect identity. In Step II, we incorporate visual similarity assistance and train a GAN with adjusted losses to preserve image utility. Thanks to the powerful 3D prior and delicate generative designs, our approach could protect the identity naturally, produce high quality details and is robust to different poses and expressions. Extensive experiments demonstrate that the proposed IDeudemon outperforms previous state-of-the-art methods.

关键词：

Self-Supervised Speaker Verification Using Dynamic Loss-Gate and label Correction

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Han, Bing Chen, Zhengyang Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of the system due to the massive unreliable labels. In this work, we propose dynamic loss-gate and label correction (DLG-LC) to alleviate the performance degradation caused by unreliable estimated labels. In DLG, we adopt Gaussian Mixture Model (GMM) to dynamically model the loss distribution and use the estimated GMM to distinguish the reliable and unreliable labels automatically. Besides, to better utilize the unreliable data instead of dropping them directly, we correct the unreliable label with model predictions. Moreover, we apply the negative-pairs-free DINO framework in our experiments for further improvement. Compared to the best-known speaker verification system with self-supervised learning, our proposed DLG-LC converges faster and achieves 11:45%, 18:35% and 15:16% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset. Copyright © 2022, The Authors. All rights reserved.

关键词： Gaussian distribution

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Gong, Xun Zhou, Zhikai Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Modern non-autoregressive (NAR) speech recognition systems aim to accelerate the inference speed;however, they suffer from performance degradation compared with autoregressive (AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model’s size. Frame- and sequence-level objectives are well-designed for transfer learning. To further boost the performance of NAR, a beam search method on Mask-CTC is developed to enlarge the search space during the inference stage. Experiments show that the proposed NAR beam search relatively reduces CER by over 5% on aiSHELL-1 benchmark with a tolerable real-time-factor (RTF) increment. By knowledge transfer, the NAR student who has the same size as the AR teacher obtains relative CER reductions of 8/16% on aiSHELL-1 dev/test sets, and over 25% relative WER reductions on LibriSpeech test-clean/other sets. Moreover, the ∼9x smaller NAR models achieve ∼25% relative CER/WER reductions on both aiSHELL-1 and LibriSpeech benchmarks with the proposed knowledge transfer and distillation. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition 18th

学校读者我要写书评

暂无评论

Collaborative Positional-Motion Excitation Module for Effici...

18th Pacific Rim International Conference on Artificial Intelligence, PRICai 2021

作者： Alsarhan, Tamam Lu, Hongtao Key Lab of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China MOE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China

ISBN: (纸本)9783030893699

Massive progress for vision-based action recognition has been made in the last few years, owing to the advancement of deep convolutional neural networks (CNNs). In contrast with 2D CNN-based approaches, 3D CNN-based approaches can effectively capture spatial and temporal features. However, they are computationally intensive. To boost 2D-CNN performance, most of the existing methods leverage channel attention (e.g. squeeze and excitation), which despite its strong impact on the model performance, operates only on the channel space and ignores the spatial space. In this work, we design a generic and collaborative excitation module, namely the Collaborative Positional-Motion Excitation Module (CPME) for action recognition. CPME is a dual-pathway excitation module designed to embed the crucial types of information, mainly the positional information and the motion information, for efficient action recognition. Positional Enhancement Pathway (PEP), the first pathway of CPME, considers encoding direction-aware and position-sensitive information. Motion Enhancement Pathway (MEP), the second pathway, encodes the motion information by emphasizing the informative features in each frame and excite motion-sensitive channels. We integrate the proposed CPME into 2D CNNs to form a simple yet effective CPME-Net with limited extra computational cost. Finally, a discriminative and diverse video-level representation for action recognition is generated by end-to-end training. Experiments on two popular action recognition datasets demonstrate that CPME blocks bring performance improvements on 2D CNN baseline, and our method achieves competitive results against the state-of-the-art methods. © 2021, Springer Nature Switzerland AG.

关键词： Signal encoding

Spatial Gradient Guided Learning and Semantic Relation Transfer for Facial Landmark Detection 27th

学校读者我要写书评

暂无评论

Spatial Gradient Guided Learning and Semantic Relation Trans...

27th International Conference on MultiMedia Modeling, MMM 2021

作者： Wang, Jian Li, Yaoyi Lu, Hongtao Key Lab of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China

ISBN: (纸本)9783030678319

Pixel-wise losses are widely used in heatmap regression networks to detect facial landmarks, however, those losses are not consistent with the evaluation criteria in testing, which is evaluating the error between the highest pixel position in the predicted heatmap and the ground-truth heatmap. In this paper, we proposed a novel spatial-gradient consistency loss function (called Grad loss), which maintains a similar spatial structure in the heatmap with ground-truth. To reduce the quantization error caused by downsampling in the network, we also propose a new post-processing strategy based on the Gaussian prior. To further improve face alignment accuracy, we introduce Spatial-Gradient Enhance attention and Relation-based Reweighing Module to transfer semantic information and spatial information between high-resolution and low-resolution representations. Extensive experiments on several benchmarks (e.g., 300W, AFLW, COFW, WFLW) show that our method outperforms the state-of-the-art by impressive margins. © 2021, Springer Nature Switzerland AG.

关键词： Face recognition

UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLlabLE SPEECH SYNTHESIS

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Guo, Yiwei Du, Chenpeng Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech synthesis

AUDIO-TEXT RETRIEVAL IN CONTEXT

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Lou, Siyu Xu, Xuenan Wu, Mengyue Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank. Copyright © 2022, The Authors. All rights reserved.

关键词： Semantics

The SJTU X-LANCE lab System for CNSRC 2022

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Chen, Zhengyang Liu, Bei Han, Bing Zhang, Leying Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

This technical report describes the SJTU X-LANCE lab system for the three tracks in CNSRC 2022. In this challenge, we explored the speaker embedding modeling ability of deep ResNet (Deeper r-vector). All the systems are only trained on the Cnceleb training set and we use the same systems for the three tracks in CNSRC 2022. In this challenge, our system ranks the first place in the fixed track of speaker verification task. Our best single system and fusion system achieve 0.3164 and 0.2975 minDCF respectively. Besides, we submit the result of ResNet221 to the speaker retrieval track and achieve 0.4626 mAP. More importantly, we have helped the wespeaker [1] toolkit reproduce our result: https://***/wenet-e2e/wespeaker. Copyright © 2022, The Authors. All rights reserved.

关键词：