We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (...
详细信息
ISBN:
(纸本)9781450395960
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system on the Flipkart Voice Search dataset. The paper also provides an overview of different components that can be tuned in a LAS based system.
Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we int...
详细信息
ISBN:
(纸本)9781665437394
Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter.
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (...
详细信息
ISBN:
(纸本)9781450395960
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system on the Flipkart Voice Search dataset. The paper also provides an overview of different components that can be tuned in a LAS based system.
Distributed Ledger technologies are becoming a standard for the management of online transactions, mainly due to their capability to ensure data privacy, trustworthiness and security. Still, they are not immune to sec...
详细信息
ISBN:
(纸本)9783030594916;9783030594909
Distributed Ledger technologies are becoming a standard for the management of online transactions, mainly due to their capability to ensure data privacy, trustworthiness and security. Still, they are not immune to security issues, as witnessed by recent successful cyber-attacks. Under a statistical perspective, attacks can be characterized as anomalous observations concerning the underlying activity. In this work, we propose an Ensemble Deep Learning approach to detect deviant behaviors on Blockchain where the base learner, an encoder-decoder model, is strengthened by iteratively learning and aggregating multiple instances, to compute an outlier score for each observation. Our experiments on historical logs of the Ethereum Classic network and synthetic data prove the capability of our model to effectively detect cyber-attacks.
Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts neede...
详细信息
ISBN:
(纸本)9781510872219
Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts needed to train a speech recognizer are usually not available for low-resource languages. Recent work has found that neural encoder-decoder models can learn to directly translate foreign speech in high-resource scenarios, without the need for intermediate transcription. We investigate whether this approach also works in settings where both data and computation are limited. To make the approach efficient, we make several architectural changes, including a change from character-level to word-level decoding. We find that this choice yields crucial speed improvements that allow us to train with fewer computational resources, yet still performs well on frequent words. We explore models trained on between 20 and 160 hours of data, and find that although models trained on less data have considerably lower BLEU scores, they can still predict words with relatively high precision and recall around 50% for a model trained on 50 hours of data, versus around 60% for the full 160 hour model. Thus, they may still be useful for some low-resource scenarios.
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodol...
详细信息
ISBN:
(纸本)9781510872219
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi;comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation.
暂无评论