检索结果-内蒙古大学图书馆

CONFORMER-BASED HYBRID ASR SYSTEM FOR SWITCHBOARD DATASET

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeineldeen, Mohammad Xu, Jingjing Lüscher, Christoph Michel, Wilfried Gerstenberger, Alexander Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition

The rwth asr system for ted-lium release 2: improving hybrid hmm with specaugment

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Michel, Wilfried Irie, Kazuki Kitza, Markus Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model size and training time. A subsequent sMBR training is applied to fine-tune the final acoustic model, and both LSTM and Transformer language models are trained and evaluated. Our best system achieves a 5.6% WER on the test set, which outperforms the previous state-of-the-art by 27% relative. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

Improved training of end-to-end attention models for speech recognition

学校读者我要写书评

暂无评论

arXiv 2018年

作者： Zeyer, Albert Irie, Kazuki Schlüter, Ralf Ney, Hermann Computer Science Department Rwth Aachen University Human Language Technology and Pattern Recognition Aachen52062 Germany AppTek United States Nnaisense Switzerland

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the testclean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model. Copyright © 2018, The Authors. All rights reserved.

关键词： Long short-term memory

Open Vocabulary Arabic Handwriting recognition Using Morphological Decomposition

学校读者我要写书评

暂无评论

Open Vocabulary Arabic Handwriting Recognition Using Morphol...

International Conference on Document Analysis and recognition

作者： Mahdi Hamdani Amr El-Desoky Mousa Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University Germany Rheinisch-Westfalische Technische Hochschule Aachen Aachen Nordrhein-Westfalen DE

ISBN: (纸本)9781479901937

The use of language Models (LMs) is a very important component in large and open vocabulary recognition systems. This paper presents an open-vocabulary approach for Arabic handwriting recognition. The proposed approach makes use of Arabic word decomposition based on morphological analysis. The vocabulary is a combination of words and sub-words obtained by the decomposition process. Out Of Vocabulary (OOV) words can be recognized by combining different elements from the lexicon. The recognition system is based on Hidden Markov Models (HMMs) with position and context dependent character models. An n-gram LM trained on the decomposed text is used along with the HMMs during the search. The approach is evaluated using two Arabic handwriting datasets. The open vocabulary approach leads to a significant improvement in the system performance. Two different types experiments for two Arabic handwriting recognition tasks are conducted in this work. The proposed approach for open vocabulary allows to have an absolute improvement of up to 1% in the Word Error Rate (WER) for the constrained task and to keep the same performance of the baseline system for the unconstrained one.

关键词： Vocabulary Hidden Markov models Handwriting recognition Training Feature extraction Context Character recognition

May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora

学校读者我要写书评

暂无评论

May the force be with you: Force-aligned signwriting for aut...

International Conference on Automatic Face and Gesture recognition

作者： Oscar Koller Hermann Ney Richard Bowden Centre for Vision Speech and Signal Processing University of Surrey Guildford UK Human Language Technology and Pattern Recognition Group RWTH Aachen University Germany

ISBN: (纸本)9781467355452

We propose a method to generate linguistically meaningful subunits in a fully automated fashion for sign language corpora. The ability to automate the process of subunit annotation has profound effects on the data available for training sign language recognition systems. The approach is based on the idea that subunits are shared among different signs. With sufficient data and knowledge of possible signing variants, accurate automatic subunit sequences are produced, matching the specific characteristics of given sign language data. Specifically we demonstrate how an iterative forced alignment algorithm can be used to transfer the knowledge of a user-edited open sign language dictionary to the task of annotating a challenging, large vocabulary, multi-signer corpus recorded from public TV. Existing approaches focus on labour intensive manual subunit annotations or on data-driven approaches. Our method yields an average precision and recall of 15% under the maximum achievable accuracy with little user intervention beyond providing a simple word gloss.

关键词： Assistive technology Gesture recognition Dictionaries Hidden Markov models Databases Manuals Vocabulary

Robust Beam Search for Encoder-Decoder Attention Based Speech recognition without Length Bias

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to ease the problem, most of which are heuristic-based and require considerable tuning. We show that heuristics are not proper modeling refinement, which results in severe performance degradation with largely increased beam sizes. We propose a novel beam search derived from reinterpreting the sequence posterior with an explicit length modeling. By applying the reinterpreted probability together with beam pruning, the obtained final probability leads to a robust model modification, which allows reliable comparison among output sequences of different lengths. Experimental verification on the LibriSpeech corpus shows that the proposed approach solves the length bias problem without heuristics or additional tuning effort. It provides robust decision making and consistently good performance under both small and very large beam sizes. Compared with the best results of the heuristic baseline, the proposed approach achieves the same WER on the 'clean' sets and 4% relative improvement on the 'other' sets. We also show that it is more efficient with the additional derived early stopping criterion. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zeineldeen, Mohammad Zeyer, Albert Zhou, Wei Ng, Thomas Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University 52062 Aachen Germany AppTek GmbH Aachen52062 Germany

Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling. Copyright © 2020, The Authors. All rights reserved.

关键词： Speech recognition

Comparing the benefit of synthetic training data for various automatic speech recognition architectures

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Rossenbach, Nick Zeineldeen, Mohammad Hilmes, Benedikt Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition

Efficient Utilization of Large Pre-Trained Models for Low Resource ASR

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Vieting, Peter Lüscher, Christoph Dierkes, Julian Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Unsupervised representation learning has recently helped automatic speech recognition (ASR) to tackle tasks with limited labeled data. Following this, hardware limitations and applications give rise to the question how to take advantage of large pre-trained models efficiently and reduce their complexity. In this work, we study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German. We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models, discuss how to adapt them to a practical telephony task including bandwidth transfer and investigate different data conditions for pre-training and fine-tuning. We outperform the project baselines by 22% relative using pre-training techniques. Further gains of 29% can be achieved by refinements of architecture and training and 6% by adding 0.8 h of in-domain adaptation data. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition