检索结果-内蒙古大学图书馆

Cumulative adaptation for BLSTM acoustic models

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Kitza, Markus Golik, Pavel Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

This paper addresses the robust speech recognition problem as an adaptation task. Specifically, we investigate the cumulative application of adaptation methods. A bidirectional Long Short-Term Memory (BLSTM) based neural network, capable of learning temporal relationships and translation invariant representations, is used for robust acoustic modeling. Further, i-vectors were used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 8% relative improvement in word error rate on the NIST Hub5 2000 evaluation testset. By enhancing the first-pass i-vector based adaptation with a second-pass adaptation using speaker and environment dependent transformations within the network, a further relative improvement of 5% in word error rate was achieved. We have reevaluated the features used to estimate i-vectors and their normalization to achieve the best performance in a modern large scale automatic speech recognition system. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

On using 2D sequence-to-sequence models for speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences. The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions. The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

language modeling with deep transformers

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Zeyer, Albert Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech recognition

Modernizing Historical Documents: a User Study

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Domingo, Miguel Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València Camino de Vera s/n Valencia46022 Spain

Accessibility to historical documents is mostly limited to scholars. This is due to the language barrier inherent in human language and the linguistic properties of these documents. Given a historical document, modernization aims to generate a new version of it, written in the modern version of the document’s language. Its goal is to tackle the language barrier, decreasing the comprehension difficulty and making historical documents accessible to a broader audience. In this work, we proposed a new neural machine translation approach that profits from modern documents to enrich its systems. We tested this approach with both automatic and human evaluation, and conducted a user study. Results showed that modernization is successfully reaching its goal, although it still has room for improvement. Copyright © 2019, The Authors. All rights reserved.

关键词： History

Improved training of end-to-end attention models for speech recognition 19

学校读者我要写书评

暂无评论

Improved training of end-to-end attention models for speech ...

19th Annual Conference of the International Speech Communication, INTERSPEECH 2018

作者： Zeyer, Albert Irie, Kazuki Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52062 Germany AppTek United States NNAISENSE Switzerland

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model. © 2018 International Speech Communication Association. All rights reserved.

关键词： Speech recognition

Creating the best development corpus for Statistical Machine Translation systems 21

学校读者我要写书评

暂无评论

Creating the best development corpus for Statistical Machine...

21st Annual Conference of the European Association for Machine Translation, EAMT 2018

作者： Chinea-Rios, Mara Sanchis-Trilles, Germán Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain Sciling València Spain

ISBN: (纸本)9788409019014

We propose and study three different novel approaches for tackling the problem of development set selection in Statistical Machine Translation. We focus on a scenario where a machine translation system is leveraged for translating a specific test set, without further data from the domain at hand. Such test set stems from a real application of machine translation, where the texts of a specific e-commerce were to be translated. For developing our development-set selection techniques, we first conducted experiments in a controlled scenario, where labelled data from different domains was available, and evaluated the techniques both with classification and translation quality metrics. Then, the best-performing techniques were evaluated on the e-commerce data at hand, yielding consistent improvements across two language directions. © 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND.

关键词： Machine translation

On the choice of modeling unit for sequence-to-sequence speech recognition

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Irie, Kazuki Prabhavalkar, Rohit Kannan, Anjuli Bruguier, Antoine Rybach, David Nguyen, Patrick Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University AachenD-52056 Germany Google Mountain ViewCA94043 United States

Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing 4

学校读者我要写书评

暂无评论

Advances on the Transcription of Historical Manuscripts base...

4th International Conference on Advances in Speech and language Technologies for Iberian languages, IberSPEECH 2018

作者： Granell, Emilio Martínez-Hinarejos, Carlos-D. Romero, Verónica Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València Camí de Vera s/n València46022 Spain

The transcription of digitalised documents is useful to ease the digital access to their contents. Natural language technologies, such as Automatic Speech recognition (ASR) for speech audio signals and Handwritten Text recognition (HTR) for text images, have become common tools for assisting transcribers, by providing a draft transcription from the digital document that they may amend. This draft is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. The work described in this thesis is focused on the improvement of the transcription offered by an HTR system from three scenarios: multimodality, interactivity and crowdsourcing. The image transcription can be obtained by dictating their textual contents to an ASR system. Besides, when both sources of information (image and speech) are available, a multimodal combination is possible, and this can be used to provide assistive systems with additional sources of information. Moreover, speech dictation can be used in a multimodal crowdsourcing platform, where collaborators may provide their speech by using mobile devices. Different solutions for each scenario were tested on two Spanish historical manuscripts, obtaining statistically significant improvements. © 4th International Conference, IberSPEECH 2018.

关键词： Crowdsourcing

Calibration of Deep Probabilistic Models with Decoupled Bayesian Neural Networks

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Maroñas, Juan Paredes, Roberto Ramos, Daniel PRHLT - Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de Valencia Spain AUDIAS - Audio Data Intelligence and Speech Universidad Autónoma de Madrid Spain

Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy performance in many tasks. However, recent works have pointed out that the outputs provided by these models are not well-calibrated, seriously limiting their use in critical decision scenarios. In this work, we propose to use a decoupled Bayesian stage, implemented with a Bayesian Neural Network (BNN), to map the uncalibrated probabilities provided by a DNN to calibrated ones, consistently improving calibration. Our results evidence that incorporating uncertainty provides more reliable probabilistic models, a critical condition for achieving good calibration. We report a generous collection of experimental results using high-accuracy DNNs in standardized image classification benchmarks, showing the good performance, flexibility and robust behavior of our approach with respect to several state-of-the-art calibration methods. Code for reproducibility is provided. Copyright © 2019, The Authors. All rights reserved.

关键词： Calibration