检索结果-内蒙古大学图书馆

Generative models for deep learning with very scarce data

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Maroñas, Juan Paredes, Roberto Ramos, Daniel Pattern Recognition and Human Language Technology Universitat Politecnica de Valencia Valencia Spain AUDIAS Universidad Autonoma de Madrid Madrid Spain

The goal of this paper is to deal with a data scarcity scenario where deep learning techniques use to fail. We compare the use of two well established techniques, Restricted Boltzmann Machines and Variational Auto-encoders, as generative models in order to increase the training set in a classification framework. Essentially, we rely on Markov Chain Monte Carlo (MCMC) algorithms for generating new samples. We show that generalization can be improved comparing this methodology to other state-of-the-art techniques, e.g. semi-supervised learning with ladder networks. Furthermore, we show that RBM is better than VAE generating new samples for training a classifier with good generalization capabilities. Copyright © 2019, The Authors. All rights reserved.

关键词： Supervised learning

The RWTH Arabic-to-English spoken language translation system

学校读者我要写书评

暂无评论

The RWTH Arabic-to-English spoken language translation syste...

IEEE Workshop on Automatic Speech recognition and Understanding

作者： Oliver Bender Evgeny Matusov Stefan Hahn Sasa Hasan Shahram Khadivi Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6-Computer Science Department RWTH Aachen University Aachen Germany

ISBN: (纸本)9781424413690;1424413699

We present the RWTH phrase-based statistical machine translation system designed for the translation of Arabic speech into English text. This system was used in the Global Autonomous language Exploitation (GALE) Go/No-Go Translation Evaluation 2007. Using a two-pass approach, we first generate n-best translation candidates and then rerank these candidates using additional models. We give a short review of the decoder as well as of the models used in both passes. We stress the difficulties of spoken language translation, i.e. how to combine the recognition and translation systems and how to compensate for missing punctuation. In addition, we cover our work on domain adaptation for the applied language models. We present translation results for the official GALE 2006 evaluation set and the GALE 2007 development set.

关键词： Natural languages Automatic speech recognition Surface-mount technology Vocabulary Decoding Speech analysis Broadcasting humans pattern recognition Computer science

Skin-color based videos categorization

学校读者我要写书评

暂无评论

International Journal of Computer Science Issues 2012年第1 1-3期9卷 473-477页

作者： Khan, Rehanullah Maqsood, Asad Khan, Zeeshan Ishaq, Muhammad Arif, Arsalan Sarhad University of Science and Information Technology Peshawar Pakistan RWTH Aachen Human Language Technology and Pattern Recognition Peshawar Pakistan UET Mardan Peshawar Pakistan

On dedicated websites, people can upload videos and share it with the rest of the world. Currently these videos are categorized manually by the help of the user community. In this paper, we propose a combination of color spaces with the Bayesian network approach for robust detection of skin color followed by an automated video categorization. Experimental results show that our method can achieve satisfactory performance for categorizing videos based on skin color. © 2012 International Journal of Computer Science Issues.

关键词： Bayesian networks

Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Menne, Tobias Sklyar, Ilya Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Significant performance degradation of automatic speech recognition (ASR) systems is observed when the audio signal contains cross-talk. One of the recently proposed approaches to solve the problem of multi-speaker ASR is the deep clustering (DPCL) approach. Combining DPCL with a state-of-the-art hybrid acoustic model, we obtain a word error rate (WER) of 16.5 % on the commonly used wsj0-2mix dataset, which is the best performance reported thus far to the best of our knowledge. The wsj0-2mix dataset contains simulated cross-talk where the speech of multiple speakers overlaps for almost the entire utterance. In a more realistic ASR scenario the audio signal contains significant portions of single-speaker speech and only part of the signal contains speech of multiple competing speakers. This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech. To this end we present a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlapping speech datasets of arbitrary overlap ratio. The analysis of applying DPCL to sparsely overlapping speech is an important interim step between the fully overlapping datasets like wsj0-2mix and more realistic ASR datasets, such as CHiME-5 or AMI. Copyright © 2019, The Authors. All rights reserved.

关键词： Speech

Efficient Training of Neural Transducer for Speech recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Zhou, Wei Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficient training pipeline. In this work, we propose an efficient 3-stage progressive training pipeline to build highly-performing neural transducer models from scratch with very limited computation resources in a reasonable short time period. The effectiveness of each stage is experimentally verified on both Librispeech and Switchboard corpora. The proposed pipeline is able to train transducer models approaching state-of-the-art performance with a single GPU in just 2-3 weeks. Our best conformer transducer achieves 4.1% WER on Librispeech test-other with only 35 epochs of training. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

Phoneme based neural transducer for large vocabulary speech recognition

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Berger, Simon Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora. Copyright © 2020, The Authors. All rights reserved.

关键词： Transducers

FULL-SUM DECODING FOR HYBRID HMM BASED SPEECH recognition USING LSTM language MODEL

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

In hybrid HMM based speech recognition, LSTM language models have been widely applied and achieved large improvements. The theoretical capability of modeling any unlimited context suggests that no recombination should be applied in decoding. This motivates to reconsider full summation over the HMM-state sequences instead of Viterbi approximation in decoding. We explore the potential gain from more accurate probabilities in terms of decision making and apply the full-sum decoding with a modified prefix-tree search framework. The proposed full-sum decoding is evaluated on both Switchboard and Librispeech corpora. Different models using CE and sMBR training criteria are used. Additionally, both MAP and confusion network decoding as approximated variants of general Bayes decision rule are evaluated. Consistent improvements over strong baselines are achieved in almost all cases without extra cost. We also discuss tuning effort, efficiency and some limitations of full-sum decoding. Copyright © 2020, The Authors. All rights reserved.

关键词： Decoding

LATTICE-FREE SEQUENCE DISCRIMINATIVE TRAINING FOR PHONEME-BASED NEURAL TRANSDUCERS

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yang, Zijian Zhou, Wei Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52074 Germany AppTek GmbH Aachen52062 Germany

Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid models, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance. Copyright © 2022, The Authors. All rights reserved.

关键词： Transducers

Early stage LM integration using local and global log-linear combination

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Michel, Wilfried Schlüter, Ralf Ney, Hermann Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Aachen52056 Germany AppTek GmbH Aachen52062 Germany

Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM) for the task of automatic speech recognition. One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora. language model integration is straightforward with the clear separation of acoustic model and language model in classical HMM-based modeling. In contrast, multiple integration schemes have been proposed for attention models. In this work, we present a novel method for language model integration into implicit-alignment based sequence-to-sequence models. Log-linear model combination of acoustic and language model is performed with a per-token renormalization. This allows us to compute the full normalization term efficiently both in training and in testing. This is compared to a global renormalization scheme which is equivalent to applying shallow fusion in training. The proposed methods show good improvements over standard model combination (shallow fusion) on our state-of-the-art Librispeech system. Furthermore, the improvements are persistent even if the LM is exchanged for a more powerful one after training. Copyright © 2020, The Authors. All rights reserved.

关键词： Hidden Markov models