检索结果-内蒙古大学图书馆

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Battenberg, Eric Skerry-Ryan, R. J. Mariooryad, Soroosh Stanton, Daisy Kao, David Shannon, Matt Bagby, Tom Google Res Mountain View CA 94043 USA

ISBN: (纸本)9781509066315

Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.

关键词： Speech synthesis attention sequence-to-sequence models

来源：评论

学校读者我要写书评

暂无评论

A DATA EFFICIENT END-TO-END SPOKEN LANGUAGE UNDERSTANDING ARCHITECTURE

A DATA EFFICIENT END-TO-END SPOKEN LANGUAGE UNDERSTANDING AR...

引用

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Dinarelli, Marco Kapoor, Nikita Jabaian, Bassam Besacier, Laurent Univ Grenoble Alpes LIG Grenoble France Avignon Univ LIA Avignon France

ISBN: (纸本)9781509066315

End-to-end architectures have been recently proposed for spoken language understanding (SLU) and semantic parsing. Based on a large amount of data, those models learn jointly acoustic and linguistic-sequential features. Such architectures give very good results in the context of domain, intent and slot detection, their application in a more complex semantic chunking and tagging task is less easy. For that, in many cases, models are combined with an external a language model to enhance their performance. In this paper we introduce a data efficient system which is trained end-to-end, with no additional, pre-trained external module. One key feature of our approach is an incremental training procedure where acoustic, language and semantic models are trained sequentially one after the other. The proposed model has a reasonable size and achieves competitive results with respect to state-of-the-art while using a small training dataset. In particular, we reach 24.02% Concept Error Rate (CER) on MEDIA/test while training on MEDIA/train without any additional data.

关键词： End-to-End SLU sequence-to-sequence models joint learning data efficiency MEDIA corpus

来源：评论

学校读者我要写书评

暂无评论

Efficient neural speech synthesis for low-resource languages through multilingual modeling 21

Efficient neural speech synthesis for low-resource languages...

引用

Interspeech Conference

作者： de Korte, Marcel Kim, Jaebok Klabbers, Esther ReadSpeaker Huis Ter Heide Netherlands

ISBN: (纸本)9781713820697

Recent advances in neural TTS have led to models that can produce high-quality synthetic speech. However, these models typically require large amounts of training data, which can make it costly to produce a new voice with the desired quality. Although multi-speaker modeling can reduce the data requirements necessary for a new voice, this approach is usually not viable for many low-resource languages for which abundant multi-speaker data is not available. In this paper, we therefore investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data. We found that multilingual modeling can increase the naturalness of low-resource language speech, showed that multilingual models can produce speech with a naturalness comparable to monolingual multi-speaker models, and saw that the target language naturalness was affected by the strategy used to add foreign language data.

关键词： neural TTS sequence-to-sequence models multilingual synthesis multi-speaker models data reduction

来源：评论

学校读者我要写书评

暂无评论

An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets 21

An Unsupervised Method to Select a Speaker Subset from Large...

引用

Interspeech Conference

作者： Gallegos, Pilar Oplustil Williams, Jennifer Rownicka, Joanna King, Simon Univ Edinburgh Ctr Speech Technol Res Edinburgh Midlothian Scotland

ISBN: (纸本)9781713820697

Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations.

关键词： speech synthesis data clustering speaker representation sequence-to-sequence models multi-speaker

来源：评论

学校读者我要写书评

暂无评论

sequence-to-sequence Speech Recognition for Air Traffic Control Communication 32

Sequence-to-Sequence Speech Recognition for Air Traffic Cont...

引用

32nd Benelux Conference on Artificial Intelligence, BNAIC 2020 and 29th Annual Belgian-Dutch Conference on Machine Learning, BeneLearn 2020

作者： Rozenbroek, Tijs Radboud University Nijmegen Netherlands

来源：评论

学校读者我要写书评

暂无评论

Controllable sentence simplification 12

Controllable sentence simplification

引用

12th International Conference on Language Resources and Evaluation, LREC 2020

作者： Martin, Louis de la Clergerie, Éric Villemonte Sagot, Benoît Bordes, Antoine Facebook AI Research 6 Rue Ménars Paris75002 France Inria Sorbonne Université 2 rue Simone Iff Paris275012 France

ISBN: (纸本)9791095546344

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all;however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on sequence-to-sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box sequence-to-sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score. © European Language Resources Association (ELRA), licensed under CC-BY-NC

关键词： ACCESS sequence-to-sequence models Text Simplification

来源：评论

学校读者我要写书评

暂无评论

A multi-encoder neural conversation model

引用

NEUROCOMPUTING 2019年 358卷 344-354页

作者： Ren, Da Cai, Yi Lei, Xue Xu, Jingyun Li, Qing Leung, Ho-fung South China Univ Technol Sch Software Engn Guangzhou Guangdong Peoples R China Hong Kong Polytech Univ Dept Comp Hung Hom Kowloon Hong Kong Peoples R China Chinese Univ Hong Kong Dept Comp Sci & Engn Hong Kong Peoples R China

With the development of deep neural networks, sequence to sequence (Seq2Seq) models become a popular technique of conversation models. Current Seq2Seq models with single encoder-decoder structures tend to generate responses which contain high frequency patterns on datasets. However, these patterns are always generic and meaningless. Generic and meaningless responses will lead the conversation between computer and human to an end quickly. According to our observations, human conversations are always topic related. If the conversation data can be divided into different clusters according to their topics, high frequency patterns will be topic related rather than generic. We consider that a model trained in different clusters can generate more topic related and meaningful responses. Inspired by this idea, we propose a Multi-Encoder Neural Conversation (MENC) model. MENC can make use of topic information by its multi-encoder structure. To the best of our knowledge, it is the first work which applies multi-encoder structures into conversation models. We conduct our experiments on two daily conversation datasets. Our experiments show that MENC gets a better performance than other mainstream models on both subject and object evaluation metrics. (C) 2019 Published by Elsevier B.V.

关键词： Multi-encoder Conversation sequence-to-sequence models

来源：评论

学校读者我要写书评

暂无评论

sequence-TO-sequence MODELLING OF F0 FOR SPEECH EMOTION CONVERSION 44

SEQUENCE-TO-SEQUENCE MODELLING OF F0 FOR SPEECH EMOTION CONV...

引用

44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Robinson, Carl Obin, Nicolas Roebel, Axel Sorbonne Univ CNRS IRCAM Paris France

ISBN: (纸本)9781479981311

Voice interfaces are becoming wildly popular and driving demand for more advanced speech synthesis and voice transformation systems. Current text-to-speech methods produce realistic sounding voices, but they lack the emotional expressivity that listeners expect, given the context of the interaction and the phrase being spoken. Emotional voice conversion is a research domain concerned with generating expressive speech from neutral synthesised speech or natural human voice. This research investigated the effectiveness of using a sequence-to-sequence (seq2seq) encoder-decoder based model to transform the intonation of a human voice from neutral to expressive speech, with some preliminary introduction of linguistic conditioning. A subjective experiment conducted on the task of speech emotion recognition by listeners successfully demonstrated the effectiveness of the proposed sequence-to-sequence models to produce convincing voice emotion transformations. In particular, conditioning the model on the position of the syllable in the phrase significantly improved recognition rates.

关键词： speech emotion conversion intonation sequence-to-sequence models

来源：评论

学校读者我要写书评

暂无评论

CONTEXTUAL SPEECH RECOGNITION WITH DIFFICULT NEGATIVE TRAINING EXAMPLES 44

CONTEXTUAL SPEECH RECOGNITION WITH DIFFICULT NEGATIVE TRAINI...

引用

44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Alon, Uri Pundak, Golan Sainath, Tara N. Technion Haifa Israel Google Inc Mountain View CA USA

ISBN: (纸本)9781479981311

Improving the representation of contextual information is key to unlocking the potential of end-to-end (E2E) automatic speech recognition (ASR). In this work, we present a novel and simple approach for training an ASR context mechanism with difficult negative examples. The main idea is to focus on proper nouns (e.g., unique entities such as names of people and places) in the reference transcript and use phonetically similar phrases as negative examples, encouraging the neural model to learn more discriminative representations. We apply our approach to an end-to-end contextual ASR model that jointly learns to transcribe and select the correct context items. We show that our proposed method gives up to 53:1% relative improvement in word error rate (WER) across several benchmarks.

关键词： speech recognition sequence-to-sequence models phonetics attention biasing

来源：评论

学校读者我要写书评

暂无评论

Very Deep Self-Attention Networks for End-to-End Speech Recognition 20

Very Deep Self-Attention Networks for End-to-End Speech Reco...

引用

Interspeech Conference

作者： Ngoc-Quan Pham Thai-Son Nguyen Niehues, Jan Mueller, Markus Waibel, Alex Karlsruhe Inst Technol Interact Syst Lab Karlsruhe Germany Carnegie Mellon Univ Pittsburgh PA 15213 USA

Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.

关键词： speech recognition sequence-to-sequence models stochastic transformer

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：