检索结果-内蒙古大学图书馆

When and Why is Unsupervised Neural Machine Translation Useless? 22

学校读者我要写书评

暂无评论

When and Why is Unsupervised Neural Machine Translation Usel...

22nd Annual Conference of the European Association for Machine Translation, EAMT 2020

作者： Kim, Yunsu Graça, Miguel Ney, Hermann Human Language Technology and Pattern Recognition Group Rwth Aachen University Aachen Germany

ISBN: (纸本)9789893305898

This paper studies the practicality of the current state-of-the-art unsupervised methods in neural machine translation (NMT). In ten translation tasks with various data settings, we analyze the conditions under which the unsupervised methods fail to produce reasonable translations. We show that their performance is severely affected by linguistic dissimilarity and domain mismatch between source and target monolingual data. Such conditions are common for low-resource language pairs, where unsupervised learning works poorly. In all of our experiments, supervised and semi-supervised baselines with 50k-sentence bilingual data outperform the best unsupervised results. Our analyses pinpoint the limits of the current unsupervised NMT and also suggest immediate research directions. © 2020 The authors.

关键词： Neural machine translation

Dynamic Encoder Size Based on Data-Driven Layer-wise Pruning for Speech recognition

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Xu, Jingjing Zhou, Wei Yang, Zijian Beck, Eugen Schlüter, Ralf Machine Learning and Human Language Technology Group Computer Science Dept. RWTH Aachen University Germany AppTek GmbH Aachen52062 Germany

Varying-size models are often required to deploy ASR systems under different hardware and/or application constraints such as memory and latency. To avoid redundant training and optimization efforts for individual models of different sizes, we present the dynamic encoder size approach, which jointly trains multiple performant models within one supernet from scratch. These subnets of various sizes are layer-wise pruned from the supernet, and thus, enjoy full parameter sharing. By combining score-based pruning with supernet training, we propose two novel methods, Simple-Top-k and Iterative-Zero-Out, to automatically select the best-performing subnets in a data-driven manner, avoiding resource-intensive search efforts. Our experiments using CTC on both Librispeech and TED-LIUM-v2 corpora show that our methods can achieve on-par performance as individually trained models of each size category. Also, our approach consistently brings small performance improvements for the full-size supernet. Copyright © 2024, The Authors. All rights reserved.

关键词：

Efficient Training of Neural Transducer for Speech recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zhou, Wei Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficient training pipeline. In this work, we propose an efficient 3-stage progressive training pipeline to build highly-performing neural transducer models from scratch with very limited computation resources in a reasonable short time period. The effectiveness of each stage is experimentally verified on both Librispeech and Switchboard corpora. The proposed pipeline is able to train transducer models approaching state-of-the-art performance with a single GPU in just 2-3 weeks. Our best conformer transducer achieves 4.1% WER on Librispeech test-other with only 35 epochs of training. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

Sample drop detection for asynchronous devices distributed in space 28

学校读者我要写书评

暂无评论

Sample drop detection for asynchronous devices distributed i...

28th European Signal Processing Conference, EUSIPCO 2020

作者： Raissi, Tina Pascual, Santiago Omologo, Maurizio Human Language Technology and Pattern Recognition RWTH Aachen University Aachen Germany Universitat Politècnica de Catalunya Barcelona Spain Trento Italy Dolby Laboratories Barcelona Spain

ISBN: (纸本)9789082797053

In many applications of multi-microphone multi-device processing, the synchronization among different input channels can be affected by the lack of a common clock and isolated drops of samples. In this work, we address the issue of sample drop detection in the context of a conversational speech scenario, recorded by a set of microphones distributed in space. The goal is to design a neural-based model that given a short window in the time domain, detects whether one or more devices have been subjected to a sample drop event. The candidate time windows are selected from a set of large time intervals, possibly including a sample drop, and by using a preprocessing step. The latter is based on the application of normalized cross-correlation between signals acquired by different devices. The architecture of the neural network relies on a CNN-LSTM encoder, followed by multi-head attention. The experiments are conducted using both artificial and real data. Our proposed approach obtained F1 score of 88% on an evaluation set extracted from the CHiME-5 corpus. A comparable performance was found in a larger set of experiments conducted on a set of multi-channel artificial scenes. © 2021 European Signal Processing Conference, EUSIPCO. All rights reserved.

关键词： Drops

MONOTONIC SEGMENTAL ATTENTION FOR AUTOMATIC SPEECH recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zeyer, Albert Schmitt, Robin Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52062 Germany AppTek GmbH Aachen52062 Germany

We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental model generalizes much better to long sequences of up to several minutes. © 2022, CC BY-SA.

关键词： Decoding

LATTICE-FREE SEQUENCE DISCRIMINATIVE TRAINING FOR PHONEME-BASED NEURAL TRANSDUCERS

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yang, Zijian Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid models, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance. Copyright © 2022, The Authors. All rights reserved.

关键词： Transducers

Improving the Training Recipe for a Robust Conformer-based Hybrid Model

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zeineldeen, Mohammad Xu, Jingjing Lüscher, Christoph Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Speaker adaptation is important to build robust automatic speech recognition (ASR) systems. In this work, we investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM) on the Switchboard 300h dataset. We propose a method, called Weighted-Simple-Add, which adds weighted speaker information vectors to the input of the multi-head self-attention module of the conformer AM. Using this method for SAT, we achieve 3.5% and 4.5% relative improvement in terms of WER on the CallHome part of Hub5'00 and Hub5'01 respectively. Moreover, we build on top of our previous work where we proposed a novel and competitive training recipe for a conformer-based hybrid AM. We extend and improve this recipe where we achieve 11% relative improvement in terms of word-error-rate (WER) on Switchboard 300h Hub5'00 dataset. We also make this recipe efficient by reducing the total number of parameters by 34% relative. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription

学校读者我要写书评

暂无评论

Combining TF-GridNet And Mixture Encoder For Continuous Spee...

IEEE Spoken language technology Workshop

作者： Peter Vieting Simon Berger Thilo von Neumann Christoph Boeddeker Ralf Schlüter Reinhold Haeb-Umbach Machine Learning and Human Language Technology Group RWTH Aachen University Germany AppTek GmbH Germany Paderborn University Germany

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

关键词： Particle separators Conferences Encoding Data models Reverberation Speech processing Streams Microphones Automatic speech recognition

ENHANCING AND ADVERSARIAL: IMPROVE ASR WITH SPEAKER LABELS

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zhou, Wei Wu, Haotian Xu, Jingjing Zeineldeen, Mohammad Lüscher, Christoph Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

ASR can be improved by multi-task learning (MTL) with domain enhancing or domain adversarial training, which are two opposite objectives with the aim to increase/decrease domain variance towards domain-aware/agnostic ASR, respectively. In this work, we study how to best apply these two opposite objectives with speaker labels to improve conformer-based ASR. We also propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort. Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training. We also explore their combination for further improvement, achieving the same performance as i-vectors plus adversarial training. Our best speaker-based MTL achieves 7% relative improvement on the Switchboard Hub5’00 set. We also investigate the effect of such speaker-based MTL w.r.t. cleaner dataset and weaker ASR NN. Copyright © 2022, The Authors. All rights reserved.

关键词： Linearization