检索结果-内蒙古大学图书馆

18th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH 2017)

作者： Prabhavalkar, Rohit Rao, Kanishka Sainath, Tara N. Li, Bo Johnson, Leif Jaitly, Navdeep Google Inc Mountain View CA 94043 USA NVIDIA Santa Clara CA USA

ISBN: (纸本)9781510848764

In this work, we conduct a detailed evaluation of various all neural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably. each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attention based model, and a model which augments the RNN transducer with an attention mechanism. We find that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.

关键词： sequence-to-sequence models attention models end-to-end models RNN transducer

来源：评论

学校读者我要写书评

暂无评论

Exploring sequence-to-sequence Transformer-Transducer models for Keyword Spotting 48

Exploring Sequence-to-Sequence Transformer-Transducer Models...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Labrador, Beltrán Zhao, Guanlong López Moreno, Ignacio Scorza Scarpati, Angelo Fowl, Liam Wang, Quan Spain Google Llc United States

ISBN: (纸本)9781728163277

In this paper, we present a novel approach to adapt a sequence-to-sequence Transformer-Transducer ASR system to the keyword spotting (KWS) task. We achieve this by replacing the keyword in the text transcription with a special token and training the system to detect the token in an audio stream. At inference time, we create a decision function inspired by conventional KWS approaches, to make our approach more suitable for the KWS task. Furthermore, we introduce a specific keyword spotting loss by adapting the sequence-discriminative Minimum Bayes-Risk training technique. We find that our approach significantly outperforms ASR based KWS systems. When compared with a conventional keyword spotting system, our proposal has similar performance while bringing the advantages and flexibility of sequence-to-sequence training. Additionally, when combined with the conventional KWS system, our approach can improve the performance at any operation point. © 2023 IEEE.

关键词： Keyword spotting sequence-to-sequence models speech recognition transformer transducer

来源：评论

学校读者我要写书评

暂无评论

Prosody recognition in Persian poetry

引用

SPEECH COMMUNICATION 2025年 170卷

作者： Shahrestani, Mohammadreza Chehreghani, Mostafa Haghir Amirkabir Univ Technol Tehran Polytech Dept Comp Engn Tehran Iran

Classical Persian poetry, like traditional poetry from other cultures, follows set metrical patterns, known as prosody. Recognizing prosody of a given poetry is very useful in understanding and analyzing Persian language and literature. With the advances in artificial intelligence (AI) techniques, they became popular to recognize prosody. However, the application of advanced AI methodologies to the task of detecting prosody in Persian poetry is not well-explored. Additionally, The lack of an extensive collection of traditional Persian poems, each meticulously annotated with its prosodic pattern, is another challenge. In this paper, first we create a large dataset of prosodic meters including about 1.3 million couplets, which contains detailed prosodic annotations. Then, we introduce five models that harness advanced deep learning methodologies to discern the prosody of Persian poetry. These models include: (i) a transformer-based classifier, (ii) a grapheme-to-phoneme mapping-based method, (iii) a sequence-to-sequence model, (iv) a sequence-to-sequence model with phonemic sequences, and (v) a hybrid approach that leverages the strengths of both the textual information of poetry and its phonemic sequence. Our experimental results reveal that the hybrid model typically outperforms the other models, especially when applied to large samples of the created dataset. Our code is publicly available in https://***/m-shahrestani/Prosody-Recognition-in-Persian-Poetry/.

关键词： Persian poetry prosody detection Deep learning Transformers sequence-to-sequence models

来源：评论

学校读者我要写书评

暂无评论

Advancing machine learning with OCR2SEQ: an innovative approach to multi-modal data augmentation

引用

JOURNAL OF BIG DATA 2024年第1期11卷 86页

作者： Lowe, Michael Prusa, Joseph D. Leevy, Joffrey L. Khoshgoftaar, Taghi M. Florida Atlantic Univ 777 Glades Rd Boca Raton FL 33431 USA

OCR2SEQ represents an innovative advancement in Optical Character Recognition (OCR) technology, leveraging a multi-modal generative augmentation strategy to overcome traditional limitations in OCR systems. This paper introduces OCR2SEQ's unique approach, tailored to enhance data quality for sequence-to-sequence models, especially in scenarios characterized by sparse character sets and specialized vocabularies. At the heart of OCR2SEQ lies a set of novel augmentation techniques designed to simulate realistic text extraction errors. These techniques are adept at generating diverse and challenging data scenarios, thereby substantially improving the training efficacy and accuracy of text-to-text transformers. The application of OCR2SEQ has shown notable improvements in data processing accuracy, particularly in sectors heavily dependent on OCR technologies such as healthcare and library sciences. This paper demonstrates the capability of OCR2SEQ to transform OCR systems by enriching them with augmented, domain-specific data, paving the way for more sophisticated and reliable machine learning interpretations. This advancement in OCR technology, as presented in the study, not only enhances the accuracy and reliability of data processing but also sets a new benchmark in the integration of augmented data for refining OCR capabilities.

关键词： Large language models Optical character recognition sequence-to-sequence models Text-to-text transformers Data augmentation Noise correction

来源：评论

学校读者我要写书评

暂无评论

Context-Relevant Denoising for Unsupervised Domain-Adapted Sentence Embeddings 25

Context-Relevant Denoising for Unsupervised Domain-Adapted S...

引用

25th IEEE International Conference on Information Reuse and Integration for Data Science (IEEE IRI)

作者： Lowe, Michael Prusa, Joseph D. Leevy, Joffrey L. Khoshgoftaar, Taghi M. Florida Atlantic Univ Boca Raton FL 33431 USA

ISBN: (纸本)9798350351194;9798350351187

In closed-system domains, such as healthcare databases, record scarcity and data quality often act as barriers to applying state-of-the-art language processing techniques. Addressing these challenges requires the adjustment of both domain and task to effectively deliver meaningful value. A common approach for adapting domains with limited and poorly annotated data is data augmentation. Transformers and Sequential Denoising Auto-Encoders (TSDAEs) offer an inductive, unsupervised pre-training method that efficiently leverages unlabeled data by learning from many-to-one corrupted training samples. This approach reduces the need for extensive manual data annotation typically associated with domain adaptation. We advance this method by using transduction-based noise generation, which simulates the kind of noise commonly encountered in text generation within targeted domains. Our study investigates the effects of corruption and contextual noise introduced by this augmentation, thus enhancing the practical ability of domain-adapted models in specialized fields.

关键词： optical character recognition sequence-to-sequence models text-to-text transformers data augmentation noise correction

来源：评论

学校读者我要写书评

暂无评论

Feature Extraction Approach for Predicting Protein-DNA Binding Residues Using Transformer Encoder-Decoder Architecture 20th

Feature Extraction Approach for Predicting Protein-DNA Bindi...

引用

20th International Conference on Intelligent Computing (ICIC)

作者： Qiu, Yi Cheng, Long Xu, Man Chen, Jing Wu, Hongjie Suzhou Univ Sci & Technol Sch Elect & Informat Engn Suzhou 215009 Jiangsu Peoples R China

ISBN: (纸本)9789819756889;9789819756896

In the realm of biology, the effects of protein binding with other molecules are of paramount importance, especially in the context of DNA binding. Precisely identifying the residues implicated in protein-DNA binding is crucial for gaining a more profound insight into the mechanisms governing protein-DNA interactions. The majority of existing methods presently utilize a two-step approach, which is plagued by drawbacks including low prediction efficiency and poor usability, thereby constraining their practical applicability. In the present study, we propose a novel method grounded in sequence-to-sequence (seq2seq) models. This model has the capability to accept variable-length complete protein sequences as input and employs Transformer encoder blocks along with feature extraction blocks for hierarchical feature extraction. Through this approach, our objective is to augment the identification capability of protein-DNA binding residues. We conducted comparative experiments on the benchmark datasets, with the results demonstrating the remarkable effectiveness of our proposed method in identifying protein-DNA binding residues. This approach presents a promising new avenue for tackling research on protein-DNA interactions.

关键词： Protein-DNA binding sequence-to-sequence models Residue identification Transformer encoder blocks Feature extraction

来源：评论

学校读者我要写书评

暂无评论

sequence-to-sequence Multi-Modal Speech In-Painting 24

Sequence-to-Sequence Multi-Modal Speech In-Painting

引用

Interspeech Conference

作者： Elyaderani, Mahsa Kadkhodaei Shirani, Shahram McMaster Univ Dept Computat Sci & Engn Hamilton ON Canada

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.

关键词： speech enhancement speech in-painting sequence-to-sequence models multi-modality Long Short-Term Memory networks

来源：评论

学校读者我要写书评

暂无评论

Learn Spelling from Teachers: Transferring Knowledge from Language models to sequence-to-sequence Speech Recognition 20

Learn Spelling from Teachers: Transferring Knowledge from La...

引用

Interspeech Conference

作者： Bai, Ye Yi, Jiangyan Tao, Jianhua Tian, Zhengkun Wen, Zhengqi Chinese Acad Sci Inst Automat NLPR Beijing Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing Peoples R China CAS Ctr Excellence Brain Sci & Intelligence Techn Shanghai Peoples R China

Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9:3%, which is relatively reduced by 18:42% compared with the vanilla sequence-to-sequence model.

关键词： knowledge distillation external language models end-to-end sequence-to-sequence models

来源：评论

学校读者我要写书评

暂无评论

Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data 20

Investigating the robustness of sequence-to-sequence text-to...

引用

Interspeech Conference

作者： Fong, Jason Gallegos, Pilar Oplustil Hodari, Zack King, Simon Univ Edinburgh Ctr Speech Technol Res Edinburgh Midlothian Scotland

sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise high quality speech when large amounts of annotated training data are available. Transcription errors exist in all data and are especially prevalent in found data such as audiobooks. In previous generations of TTS technology, alignment using Hidden Markov models (HMMs) was widely used to identify and eliminate bad data. In S2S models, the use of attention replaces HMM-based alignment, and there is no explicit mechanism for removing bad data. It is not yet understood how such models deal with transcription errors in the training data. We evaluate the quality of speech from S2S-TTS models when trained on data with imperfect transcripts, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). We find that attention can skip over extraneous words in the input sequence, providing robustness to insertion errors. But substitutions and deletions pose a problem because there is no ground truth input available to align to the ground truth acoustics during teacher-forced training. We conclude that S2S-TTS systems are only partially robust to training on imperfectly-transcribed data and further work is needed.

关键词： speech synthesis sequence-to-sequence models found data

来源：评论

学校读者我要写书评

暂无评论

Abstract Representation for Multi-Intent Spoken Language Understanding 48

Abstract Representation for Multi-Intent Spoken Language Und...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Abrougui, Rim Damnati, Géraldine Heinecke, Johannes Béchet, Frédéric Orange Innovation Lannion France Aix-Marseille University Cnrs Marseille France

ISBN: (纸本)9781728163277

Current sequence tagging models based on Deep Neural Network models with pretrained language models achieve almost perfect results on many SLU benchmarks with a flat semantic annotation at the token level such as ATIS or SNIPS. When dealing with more complex human-machine interactions (multi-domain, multi-intent, dialog context), relational semantic structures are needed in order to encode the links between slots and intents within an utterance and through dialog history. We propose in this study a new way to project annotation in an abstract structure with more compositional expressive power and a model to directly generate this abstract structure. We evaluate it on the MultiWoz dataset in a contextual SLU experimental setup. We show that this projection can be used to extend the existing flat annotations towards graph-based structures. © 2023 IEEE.

关键词： Natural Language Understanding sequence tagging sequence-to-sequence models Spoken Language Understanding

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：