检索结果-内蒙古大学图书馆

On architectures and training for raw waveform feature extraction in ASR

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Vieting, Peter Lüscher, Christoph Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

With the success of neural network based modeling in automatic speech recognition (ASR), many studies investigated acoustic modeling and learning of feature extractors directly based on the raw waveform. Recently, one line of research has focused on unsupervised pre-training of feature extractors on audio-only data to improve downstream ASR performance. In this work, we investigate the usefulness of one of these front-end frameworks, namely wav2vec, in a setting without additional untranscribed data for hybrid ASR systems. We compare this framework both to the manually defined standard Gammatone feature set, as well as to features extracted as part of the acoustic model of an ASR system trained supervised. We study the benefits of using the pre-trained feature extractor and explore how to additionally exploit an existing acoustic model trained with different features. Finally, we systematically examine combinations of the described features in order to further advance the performance. Copyright © 2021, The Authors. All rights reserved.

关键词： Feature extraction

On sampling-based training criteria for neural language modeling

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Gao, Yingbo Thulke, David Gerstenberger, Alexander Tran, Khoa Viet Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department Rwth Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and crossentropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups. Copyright © 2021, The Authors. All rights reserved.

关键词： Importance sampling

Two-way neural machine translation: A proof of concept for bidirectional translation modeling using a two-dimensional grid

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Bahar, Parnia Brix, Christopher Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Neural translation models have proven to be effective in capturing sufficient information from a source sentence and generating a high-quality target sentence. However, it is not easy to get the best effect for bidirectional translation, i.e., both source-to-target and target-to-source translation using a single model. If we exclude some pioneering attempts, such as multilingual systems, all other bidirectional translation approaches are required to train two individual models. This paper proposes to build a single end-to-end bidirectional translation model using a two-dimensional grid, where the left-to-right decoding generates source-to-target, and the bottom-to-up decoding creates target-to-source output. Instead of training two models independently, our approach encourages a single network to jointly learn to translate in both directions. Experiments on the WMT 2018 German↔English and Turkish↔English translation tasks show that the proposed model is capable of generating a good translation quality and has sufficient potential to direct the research. © 2020, CC-BY.

关键词： Decoding

Self-Normalized Importance Sampling for Neural language Modeling

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Yang, Zijian Gao, Yingbo Gerstenberger, Alexander Jiang, Jintao Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks. Copyright © 2021, The Authors. All rights reserved.

关键词： Importance sampling

language modeling with deep transformers

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

Tight integrated end-to-end training for cascaded speech translation

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Bahar, Parnia Bieschke, Tobias Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation;however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models. © 2020, CC-BY.

关键词： Backpropagation

ROBUST KNOWLEDGE DISTILLATION FROM RNN-T MODELS WITH NOISY TRAINING LABELS USING FULL-SUM LOSS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zeineldeen, Mohammad Audhkhasi, Kartik Baskar, Murali Karthick Ramabhadran, Bhuvana Human Language Technology and Pattern Recognition Computer Science Department Rwth Aachen University Aachen52074 Germany Google Llc New York United States

This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNNT) models. In hard distillation, a teacher model transcribes large amounts of unlabelled speech to train a student model. Soft distillation is another popular KD method that distills the output logits of the teacher model. Due to the nature of RNN-T alignments, applying soft distillation between RNNT architectures having different posterior distributions is challenging. In addition, bad teachers having high word-error-rate (WER) reduce the efficacy of KD. We investigate how to effectively distill knowledge from variable quality ASR teachers, which has not been studied before to the best of our knowledge. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models, especially for bad teachers. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER. We conduct experiments on public datasets namely SpeechStew and LibriSpeech, and on in-house production data. Copyright © 2023, The Authors. All rights reserved.

关键词： Recurrent neural networks

Non-stationary feature extraction for automatic speech recognition

学校读者我要写书评

暂无评论

Non-stationary feature extraction for automatic speech recog...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zoltán Tüske Pavel Golik Ralf Schlüter Friedhelm R. Drepper Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen Germany Zentralinstitut für Elektronik Forschungszentrum Jülich (KFA) Julich Germany

In current speech recognition systems mainly Short-Time Fourier Transform based features like MFCC are applied. Dropping the short-time stationarity assumption of the voiced speech, this paper introduces the non-stationary signal analysis into the ASR framework. We present new acoustic features extracted by a pitch-adaptive Gammatone filter bank. The noise robustness was proved on AURORA 2 and 4 tasks, where the proposed features outperform the standard MFCC. Furthermore, successful combination experiments via ROVER indicate the differences between the new features and MFCC.

关键词： Mel frequency cepstral coefficient Harmonic analysis Feature extraction Speech Speech recognition Time frequency analysis

On the choice of modeling unit for sequence-to-sequence speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Prabhavalkar, Rohit Kannan, Anjuli Bruguier, Antoine Rybach, David Nguyen, Patrick Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University AachenD-52056 Germany Google Mountain ViewCA94043 United States