检索结果-内蒙古大学图书馆

Self-Normalized Importance Sampling for Neural language Modeling

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Yang, Zijian Gao, Yingbo Gerstenberger, Alexander Jiang, Jintao Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks. Copyright © 2021, The Authors. All rights reserved.

关键词： Importance sampling

On architectures and training for raw waveform feature extraction in ASR

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Vieting, Peter Lüscher, Christoph Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

With the success of neural network based modeling in automatic speech recognition (ASR), many studies investigated acoustic modeling and learning of feature extractors directly based on the raw waveform. Recently, one line of research has focused on unsupervised pre-training of feature extractors on audio-only data to improve downstream ASR performance. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, in a setting without additional untranscribed data for hybrid ASR systems. We compare this framework both to the manually defined standard Gammatone feature set, as well as to features extracted as part of the acoustic model of an ASR system trained supervised. We study the benefits of using the pre-trained feature extractor and explore how to additionally exploit an existing acoustic model trained with different features. Finally, we systematically examine combinations of the described features in order to further advance the performance. Copyright © 2021, The Authors. All rights reserved.

关键词： Feature extraction

A study of latent monotonic attention variants

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen Germany AppTek GmbH Aachen Germany

End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic, which might lead to convergence problems, to instability, to bad generalisation, cannot be used for online streaming, and is also inefficient in calculation. Monotonicity can potentially fix all of this. There are several ad-hoc solutions or heuristics to introduce monotonicity, but a principled introduction is rarely found in literature so far. In this paper, we present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries. We compare several monotonic latent models to our global soft attention baseline such as a hard attention model, a local windowed soft attention model, and a segmental soft attention model. We can show that our monotonic models perform as good as the global soft attention model. We perform our experiments on Switchboard 300h. We carefully outline the details of our training and release our code and configs. © 2021, CC BY.

关键词： Speech recognition

Why does CTC result in peaky behavior?

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen Germany AppTek GmbH Aachen Germany

The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is suboptimal. On a simple example which should be trivial to learn for any model, we prove that a feed-forward neural network trained with CTC from uniform initialization converges towards peaky behavior with a 100% error rate. Our analysis further explains why CTC only works well together with the blank label. We further demonstrate that peaky behavior does not occur on other related losses including a label prior model, and that this improves convergence. © 2021, CC BY-SA.

关键词： Gradient methods

Investigation on data adaptation techniques for neural named entity recognition

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Tokarchuk, Evgeniia Thulke, David Wang, Weiyue Dugast, Christian Ney, Hermann Informatics Institute University of Amsterdam Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University

Data processing is an important step in various natural language processing tasks. As the commonly used datasets in named entity recognition contain only a limited number of samples, it is important to obtain additional labeled data in an efficient and reliable manner. A common practice is to utilize large monolingual unlabeled corpora. Another popular technique is to create synthetic data from the original labeled data (data augmentation). In this work, we investigate the impact of these two methods on the performance of three different named entity recognition tasks. Copyright © 2021, The Authors. All rights reserved.

关键词： Natural language processing systems

Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Daheim, Nico Thulke, David Dugast, Christian Ney, Hermann Ubiquitous Knowledge Processing Lab Department of Computer Science Technical University of Darmstadt Germany Human Language Technology and Pattern Recognition RWTH Aachen University Germany AppTek GmbH Germany

In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes' theorem. One component is a traditional ungrounded response generation model and the other component models the reconstruction of the grounding document based on the dialog context and generated response. We propose different approximate decoding schemes and evaluate our approach on multiple open-domain and task-oriented document-grounded dialog datasets. Our experiments show that the model is more factual in terms of automatic factuality metrics than the baseline model. Furthermore, we outline how introducing scaling factors between the components allows for controlling the tradeoff between factuality and fluency in the model output. Finally, we compare our approach to a recently proposed method to control factuality in grounded dialog, CTRL (Rashkin et al., 2021), and show that both approaches can be combined to achieve additional improvements. © 2022, CC BY.

关键词：

Automatic learning of subword dependent model scales

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Meyer, Felix Michel, Wilfried Zeineldeen, Mohammad Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to the neural network model parameters. We show on the LibriSpeech (LBS) and Switchboard (SWB) corpora that the model scales for a combination of attention-based encoder-decoder acoustic model and language model can be learned as effectively as with manual tuning. We further extend this approach to subword dependent model scales which could not be tuned manually which leads to 7% improvement on LBS and 3% on SWB. We also show that joint training of scales and model parameters is possible and gives additional 6% improvement on LBS. Copyright © 2021, The Authors. All rights reserved.

关键词： Stochastic models

On language model integration for RNN transducer based speech recognition

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zhou, Wei Zheng, Zuoyun Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to remove this sequence prior as ILM correction. In this work, we study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer, which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both in-domain and cross-domain evaluation on the Librispeech and TED-LIUM Release 2 corpora, respectively. Our proposed exact-ILM training can further improve the best ILM method. Copyright © 2021, The Authors. All rights reserved.

关键词： Transducers

Investigating methods to improve language model integration for attention-based encoder-decoder ASR models

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeineldeen, Mohammad Glushko, Aleksandr Michel, Wilfried Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to this implicit LM, similarly as in the hybrid hidden Markov model approach. The implicit LM cannot be calculated efficiently in general and it is yet unclear what are the best methods to estimate it. In this work, we compare different approaches from the literature and propose several novel methods to estimate the ILM directly from the AED model. Our proposed methods outperform all previous approaches. We also investigate other methods to suppress the ILM mainly by decreasing the capacity of the AED model, limiting the label context, and also by training the AED model together with a pre-existing LM. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition