检索结果-内蒙古大学图书馆

The rwth asr system for ted-lium release 2: improving hybrid hmm with specaugment

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Michel, Wilfried Irie, Kazuki Kitza, Markus Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model size and training time. A subsequent sMBR training is applied to fine-tune the final acoustic model, and both LSTM and Transformer language models are trained and evaluated. Our best system achieves a 5.6% WER on the test set, which outperforms the previous state-of-the-art by 27% relative. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

CONFORMER-BASED HYBRID ASR SYSTEM FOR SWITCHBOARD DATASET

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeineldeen, Mohammad Xu, Jingjing Lüscher, Christoph Michel, Wilfried Gerstenberger, Alexander Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition

RADMM: RECURRENT ADAPTIVE MIXTURE MODEL WITH APPLICATIONS TO DOMAIN ROBUST language MODELING

学校读者我要写书评

暂无评论

RADMM: RECURRENT ADAPTIVE MIXTURE MODEL WITH APPLICATIONS TO...

IEEE International Conference on Acoustics, Speech and Signal Processing

作者： Kazuki Irie Shankar Kumar Michael Nirschl Hank Liao Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University D-52056 Aachen Germany Google Inc. New York NY 10011 USA

ISBN: (纸本)9781538646595

We present a new architecture and a training strategy for an adaptive mixture of experts with applications to domain robust language modeling. The proposed model is designed to benefit from the scenario where the training data are available in diverse domains as is the case for YouTube speech recognition. The two core components of our model are an ensemble of parallel long short-term memory (LSTM) expert layers for each domain and another LSTM based network which generates state dependent mixture weights for combining expert LSTM states by linear interpolation. The resulting model is a recurrent adaptive mixture model (RADMM) of domain experts. We train our model on 4.4B words from YouTube speech recognition data. We report results on the YouTube speech recognition test set. Compared with a background LSTM model, we obtain up to 12% relative improvement in perplexity and an improvement in word error rate from 12.3% to 12.1% while using a lattice rescoring with strong pruning.

关键词： language modeling neural networks speech recognition mixture of experts domain adaptation modelling languages Speech recognition Neural network Professional personnel YouTube Mixture models training policy new buildings

Robust Beam Search for Encoder-Decoder Attention Based Speech recognition without Length Bias

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to ease the problem, most of which are heuristic-based and require considerable tuning. We show that heuristics are not proper modeling refinement, which results in severe performance degradation with largely increased beam sizes. We propose a novel beam search derived from reinterpreting the sequence posterior with an explicit length modeling. By applying the reinterpreted probability together with beam pruning, the obtained final probability leads to a robust model modification, which allows reliable comparison among output sequences of different lengths. Experimental verification on the LibriSpeech corpus shows that the proposed approach solves the length bias problem without heuristics or additional tuning effort. It provides robust decision making and consistently good performance under both small and very large beam sizes. Compared with the best results of the heuristic baseline, the proposed approach achieves the same WER on the 'clean' sets and 4% relative improvement on the 'other' sets. We also show that it is more efficient with the additional derived early stopping criterion. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zeineldeen, Mohammad Zeyer, Albert Zhou, Wei Ng, Thomas Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University 52062 Aachen Germany AppTek GmbH Aachen52062 Germany

Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

Comparing the benefit of synthetic training data for various automatic speech recognition architectures

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Rossenbach, Nick Zeineldeen, Mohammad Hilmes, Benedikt Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition

Confidence scores for acoustic model adaptation

学校读者我要写书评

暂无评论

Confidence scores for acoustic model adaptation

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Christian Gollan Michiel Bacchiani Human Language Technology and Pattern Recognition Computer Science Department 6 RWTH Aachen University Germany Google Inc. New York NY USA

This paper focuses on confidence scores for use in acoustic model adaptation. Frame-based confidence estimates are used in linear transform (CMLLR and MLLR) and MAP adaptation. We show that adaptation approaches with a limited number of free parameters such as linear transform-based approaches are robust in the face of frame labeling errors whereas adaptation approaches with a large number of free parameters such as MAP are sensitive to the quality of the supervision and hence benefit most from use of confidences. Different approaches for using confidence information in adaptation are investigated. This analysis shows that a thresholding approach is effective in that it improves the frame labeling accuracy with little detrimental effect on frame recall. Experimental results show an absolute WER reduction of 2.1% over a CMLLR adapted system on a video transcription task.

关键词： Adaptation model Maximum likelihood linear regression Parameter estimation Speech Robustness Labeling Multimedia systems humans pattern recognition computer science

ROBUST KNOWLEDGE DISTILLATION FROM RNN-T MODELS WITH NOISY TRAINING LABELS USING FULL-SUM LOSS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zeineldeen, Mohammad Audhkhasi, Kartik Baskar, Murali Karthick Ramabhadran, Bhuvana Human Language Technology and Pattern Recognition Computer Science Department Rwth Aachen University Aachen52074 Germany Google Llc New York United States

This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNNT) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNNT architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data. Copyright © 2023, The Authors. All rights reserved.

关键词： Recurrent neural networks

Warp that smile on your face: Optimal and smooth deformations for face recognition

学校读者我要写书评

暂无评论

Warp that smile on your face: Optimal and smooth deformation...

International Conference on Automatic Face and Gesture recognition

作者： Tobias Gass Leonid Pishchulin Philippe Dreuw Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University Germany Computer Vision Laboratory ETH Zurich Switzerland Computer Vision and Multimodal Computing MPI Informatics Saarbruecken Germany

In this work, we present novel warping algorithms for full 2D pixel-grid deformations for face recognition. Due to high variation in face appearance, face recognition is considered a very difficult task, especially if only a single reference image, for example a mug-shot, per face is available. Usually model-based approaches with additional training data are used to cope with several types of variation occurring in facial imaging. Image warping contrarily yields a distance measure which is invariant with regard to several types of variation. This allows for precise recognition even using only very few reference observations. Due to the computationally complex problem of optimal 2D warping, pseudo-2D warping-based approaches in the past represented strong approximations of the original problem, and were mainly successful on data with low variability or rectified images. We propose a novel 2D warping method which is globally optimal and makes no prior assumtions on the data variability besides two-dimensional smootheness constraints which both avoid local mirroring and gaps and significantly speed up the optimization. Furthermore, we show that occlusion handling is imperative to obtain smooth warpings in a variety of domains. We evaluate our novel algorithm on various well known databases, such as the AR-Face and CMU-PIE database, and provide a detailed comparison to existing warping approaches. We show that by using simple relative 2D constraints, strong local features and a kernel, which is robust w.r.t. occlusions, our computationally complex approaches outperform state-of-the-art results for recognizing faces under varying expressions, occlusions and poses. Most interestingly, we achieve higher accuracy using fewer training instances per class compared to methods learning a model of the 3D shape.

关键词： Pixel Face Face recognition Optimization Databases Approximation methods Hidden Markov models