检索结果-内蒙古大学图书馆

Investigation of Transformer-based Latent Attention Models for Neural Machine Translation 14

学校读者我要写书评

暂无评论

Investigation of Transformer-based Latent Attention Models f...

14th Conference of the Association for Machine Translation in the Americas, AMTA 2020

作者： Bahar, Parnia Makarov, Nikita Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Current neural translation networks are based on an effective attention mechanism that can be considered as an implicit probabilistic notion of alignment. Such architectures do not guarantee a high quality alignment, even though alignments can easily be used for explainable machine translation. This work describes a latent variable attention model using the transformer architecture, where we carry out an approximate marginalization over alignments. We show that the alignment quality in transformer models can be improved by introducing a latent variable for the alignments. To study the effect of the latent model, we quantitatively and qualitatively analyze the extracted alignments from the multi-head attention. We demonstrate that this method slightly improves translation quality on four WMT 2018 shared translation tasks, as well as generating more focused alignments for better interpretability. © 2020 AMTA 2020 - 14th Conference of the Association for Machine Translation in the Americas, Proceedings. All rights reserved.

关键词： Alignment

Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Daheim, Nico Thulke, David Dugast, Christian Ney, Hermann Ubiquitous Knowledge Processing Lab Department of Computer Science Technical University of Darmstadt Germany Human Language Technology and Pattern Recognition RWTH Aachen University Germany AppTek GmbH Germany

In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes' theorem. One component is a traditional ungrounded response generation model and the other component models the reconstruction of the grounding document based on the dialog context and generated response. We propose different approximate decoding schemes and evaluate our approach on multiple open-domain and task-oriented document-grounded dialog datasets. Our experiments show that the model is more factual in terms of automatic factuality metrics than the baseline model. Furthermore, we outline how introducing scaling factors between the components allows for controlling the tradeoff between factuality and fluency in the model output. Finally, we compare our approach to a recently proposed method to control factuality in grounded dialog, CTRL (Rashkin et al., 2021), and show that both approaches can be combined to achieve additional improvements. © 2022, CC BY.

关键词：

On architectures and training for raw waveform feature extraction in ASR

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Vieting, Peter Lüscher, Christoph Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

With the success of neural network based modeling in automatic speech recognition (ASR), many studies investigated acoustic modeling and learning of feature extractors directly based on the raw waveform. Recently, one line of research has focused on unsupervised pre-training of feature extractors on audio-only data to improve downstream ASR performance. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, in a setting without additional untranscribed data for hybrid ASR systems. We compare this framework both to the manually defined standard Gammatone feature set, as well as to features extracted as part of the acoustic model of an ASR system trained supervised. We study the benefits of using the pre-trained feature extractor and explore how to additionally exploit an existing acoustic model trained with different features. Finally, we systematically examine combinations of the described features in order to further advance the performance. Copyright © 2021, The Authors. All rights reserved.

关键词： Feature extraction

On sampling-based training criteria for neural language modeling

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Gao, Yingbo Thulke, David Gerstenberger, Alexander Tran, Khoa Viet Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department Rwth Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and crossentropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups. Copyright © 2021, The Authors. All rights reserved.

关键词： Importance sampling

Self-Normalized Importance Sampling for Neural language Modeling

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Yang, Zijian Gao, Yingbo Gerstenberger, Alexander Jiang, Jintao Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks. Copyright © 2021, The Authors. All rights reserved.

关键词： Importance sampling

A study of latent monotonic attention variants

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen Germany AppTek GmbH Aachen Germany

End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic, which might lead to convergence problems, to instability, to bad generalisation, cannot be used for online streaming, and is also inefficient in calculation. Monotonicity can potentially fix all of this. There are several ad-hoc solutions or heuristics to introduce monotonicity, but a principled introduction is rarely found in literature so far. In this paper, we present a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries. We compare several monotonic latent models to our global soft attention baseline such as a hard attention model, a local windowed soft attention model, and a segmental soft attention model. We can show that our monotonic models perform as good as the global soft attention model. We perform our experiments on Switchboard 300h. We carefully outline the details of our training and release our code and configs. © 2021, CC BY.

关键词： Speech recognition

Why does CTC result in peaky behavior?

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen Germany AppTek GmbH Aachen Germany

The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is suboptimal. On a simple example which should be trivial to learn for any model, we prove that a feed-forward neural network trained with CTC from uniform initialization converges towards peaky behavior with a 100% error rate. Our analysis further explains why CTC only works well together with the blank label. We further demonstrate that peaky behavior does not occur on other related losses including a label prior model, and that this improves convergence. © 2021, CC BY-SA.

关键词： Gradient methods

AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Wen, Jiaxin Zhu, Yeshuang Zhang, Jinchao Zhou, Jie Huang, Minlie The CoAI group Tsinghua University Beijing China Department of Computer Science and Technology Tsinghua University Beijing China Pattern Recognition Center WeChat AI Tencent Inc China

Recent studies have shown the impressive efficacy of counterfactually augmented data (CAD) for reducing NLU models’ reliance on spurious features and improving their generalizability. However, current methods still heavily rely on human efforts or task-specific designs to generate counterfactuals, thereby impeding CAD’s applicability to a broad range of NLU tasks. In this paper, we present AutoCAD, a fully automatic and task-agnostic CAD generation framework. AutoCAD first leverages a classifier to unsupervisedly identify rationales as spans to be intervened, which disentangles spurious and causal features. Then, AutoCAD performs controllable generation enhanced by unlikelihood training to produce diverse counterfactuals. Extensive evaluations on multiple out-of-domain and challenge benchmarks demonstrate that AutoCAD consistently and significantly boosts the out-of-distribution performance of powerful pre-trained models across different NLU tasks, which is comparable or even better than previous state-of-the-art human-in-the-loop or task-specific CAD methods. The code is publicly available at https://***/thu-coai/AutoCAD. Copyright © 2022, The Authors. All rights reserved.

关键词： Benchmarking

Investigating methods to improve language model integration for attention-based encoder-decoder ASR models

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zeineldeen, Mohammad Glushko, Aleksandr Michel, Wilfried Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to this implicit LM, similarly as in the hybrid hidden Markov model approach. The implicit LM cannot be calculated efficiently in general and it is yet unclear what are the best methods to estimate it. In this work, we compare different approaches from the literature and propose several novel methods to estimate the ILM directly from the AED model. Our proposed methods outperform all previous approaches. We also investigate other methods to suppress the ILM mainly by decreasing the capacity of the AED model, limiting the label context, and also by training the AED model together with a pre-existing LM. Copyright © 2021, The Authors. All rights reserved.

关键词： Speech recognition