检索结果-内蒙古大学图书馆

A Comparative Analysis of Generative Neural attention-based Service Chatbot

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 2022年第8期13卷 742-751页

作者： Suhaili, Sinarwati Mohamad Salim, Naomie Jambli, Mohamad Nazim Pre Univ Kota Samarahan Sarawak Malaysia Univ Teknol Malaysia Fac Comp Skudai 81310 Johor Malaysia Univ Teknol Malaysia Ibnu Sina Inst Sci & Ind Res UTM Big Data Ctr Skudai 81310 Johor Malaysia Univ Malaysia Sarawak Fac Comp Sci & Informat Technol Kota Samarahan Sarawak Malaysia

Companies constantly rely on customer support to deliver pre-and post-sale services to their clients through websites, mobile devices or social media platforms such as Twitter. In assisting customers, companies employ virtual service agents (chatbots) to provide support via communication devices. The primary focus is to automate the generation of conversational chat between a computer and a human by constructing vir-tual service agents that can predict appropriate and automatic responses to customers' queries. This paper aims to present and implement a seq2seq-based learning task model based on encoder-decoder architectural solutions by training generative chatbots on customer support Twitter datasets. The model is based on deep Recurrent Neural Networks (RNNs) structures which are uni-directional and bi-directional encoder types of Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU). The RNNs are augmented with an attention layer to focus on important information between input and output sequences. Word level embedding such as Word2Vec, GloVe, and FastText are employed as input to the model. Incorporating the base architecture, a comparative analysis is applied where baseline models are compared with and without the use of attention as well as different types of input embedding for each experi-ment. Bilingual Evaluation Understudy (BLEU) was employed to evaluate the model's performance. Results revealed that while biLSTM performs better with Glove, biGRU operates better with FastText. Thus, the finding significantly indicated that the attention-based, bi-directional RNNs (LSTM or GRU) model significantly outperformed baseline approaches in their BLEU score as a promising use in future works.

关键词： Sequence-to-sequence encoder-decoder service chatbot attention-based encoder-decoder Recurrent Neural Network (RNN) Long Short-Term Memory (LSTM) Gated Recurrent Unit (GRU) word embedding

来源：评论

学校读者我要写书评

暂无评论

Residual Language Model for End-to-end Speech Recognition 23

Residual Language Model for End-to-end Speech Recognition

引用

Interspeech Conference

作者： Tsunoo, Emiru Kashiwagi, Yosuke Narisetty, Chaitanya Watanabe, Shinji Sony Grp Corp Tokyo Japan Carnegie Mellon Univ Pittsburgh PA 15213 USA

End-to-end automatic speech recognition suffers from adaptation to unknown target domain speech despite being trained with a large amount of paired audio-text data. Recent studies estimate a linguistic bias of the model as the internal language model (LM). To effectively adapt to the target domain, the internal LM is subtracted from the posterior during inference and fused with an external target-domain LM. However, this fusion complicates the inference and the estimation of the internal LM may not always be accurate. In this paper, we propose a simple external LM fusion method for domain adaptation, which considers the internal LM estimation in its training. We directly model the residual factor of the external and internal LMs, namely the residual LM. To stably train the residual LM, we propose smoothing the estimated internal LM and optimizing it with a combination of cross-entropy and mean-squared-error losses, which consider the statistical behaviors of the internal LM in the target domain data. We experimentally confirmed that the proposed residual LM performs better than the internal LM estimation in most of the cross-domain and intra-domain scenarios.

关键词： speech recognition language model attention-based encoder-decoder internal language model estimation

来源：评论

学校读者我要写书评

暂无评论

Review of methods of end-to-end automatic recognition of Kazakh speech

引用

Procedia Computer Science 2024年 251卷 615-620页

作者： Yerlan Karabaliyev Kateryna Kolesnikova Nurkhan Batyrkhan International IT University 34/1 Manas str. Almaty Kazakhstan

This paper provides a comprehensive review of end-to-end automatic speech recognition methods for the Kazakh language, which is considered a low-resource language with unique phonetic and grammatical features. These features present significant challenges for automatic speech recognition systems. The review, conducted in accordance with the Arksey and O'Malley framework, evaluated the effectiveness of various approaches, including traditional methods that involve separate stages of acoustic modeling, language modeling, and lexicon development, as well as modern end-to-end models. Our analysis identified 22 relevant studies published between 2019 and 2024, including 8 controlled trials, 2 design studies, and 12 intervention studies. The findings demonstrate that end-to-end models, such as Connectionist Temporal Classification (CTC) and Recurrent Neural Network Transducer (RNN-T), show promise in improving the accuracy of Kazakh speech recognition by integrating all recognition stages into a single neural network. However, challenges remain, particularly in the availability of annotated data and the need for more robust models tailored to the Kazakh language.

关键词： End-to-end models Automatic Speech Recognition (ASR) Kazakh Language Connectionist Temporal Classification (CTC) Recurrent Neural Network Transducer (RNN-T) attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

An End-to-End Network for Continuous Human Motion Recognition via Radar Radios

引用

IEEE SENSORS JOURNAL 2021年第5期21卷 6487-6496页

作者： Zhao, Running Ma, Xiaolin Liu, Xinhua Liu, Jian Wuhan Univ Technol Sch Informat Engn Hubei Key Lab Broadband Wireless Commun & Sensor Wuhan 430070 Peoples R China Univ Tennessee Dept Elect Engn & Comp Sci Knoxville TN 37996 USA

Micro-Doppler-based continuous human motion recognition (HMR) has gained considerable attention recently. However, existing methods mainly rely on individual recurrent neural network or sliding-window-based approaches, which makes them hard to effectively exploit all the temporal information to predict motions. Additionally, they need to represent the raw radar data into other domains and then perform feature extraction and classification. Thus, the representation cannot be optimized, and its high computational complexity and independence from learning model make the network consume significant time. In this article, to address these issues, we propose a new end-to-end network that uses radar radios to recognize continuous motion. Specifically, the fusion layer fuses the raw I & Q radar data without the need of representations, and it is integrated with subsequent networks in an end-to-end manner for jointly optimization. Moreover, the attention-based encoder-decoder structure encodes the fused data and selects useful temporal information for recognition, which guarantees the effective use of all the temporal information. The experiments show that in continuous HMR, the proposed network outperforms existing methods in terms of accuracy and inference time.

关键词： Radar Sensors Feature extraction Spectrogram Vibrations Convolution Data models Continuous human motion recognition micro-Doppler raw radar data deep learning end-to-end attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION

INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-...

引用

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Meng, Zhong Kanda, Naoyuki Gaur, Yashesh Parthasarathy, Sarangarajan Sun, Eric Lu, Liang Chen, Xie Li, Jinyu Gong, Yifan Microsoft Corp Redmond WA 98052 USA

ISBN: (纸本)9781728176055

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method [1]. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based inference, we propose an internal LM training (ILMT) method to minimize an additional internal LM loss by updating only the E2E model components that affect the internal LM estimation. ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched training and inference criteria enables a more thorough elimination of the source-domain internal LM, and therefore leads to a more effective integration of the target-domain external LM. Experimented with 30K-hour trained recurrent neural network transducer and attention-based encoder-decoder models, ILMT with ILME-based inference achieves up to 31.5% and 11.4% relative word error rate reductions from standard E2E training with Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.

关键词： Speech recognition language model recurrent neural network transducer attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

STREAMING END-TO-END SPEECH RECOGNITION WITH JOINTLY TRAINED NEURAL FEATURE ENHANCEMENT

STREAMING END-TO-END SPEECH RECOGNITION WITH JOINTLY TRAINED...

引用

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Kim, Chanwoo Garg, Abhinav Gowda, Dhananjaya Mun, Seongkyu Han, Changwoo Samsung Res Seoul South Korea

ISBN: (纸本)9781728176055

In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers gradually increases. With GREL, the portion of the Mean Squared Error (MSE) loss for the enhanced output gradually reduces as training proceeds. In experimental results on the LibriSpeech corpus and noisy far-field test sets, the proposed model with GAEF-GREL training strategies shows significantly better results than the conventional multi-style training approach.

关键词： end-to-end speech recognition data augmentation monotonic chunkwise attention attention-based encoder-decoder acoustic simulator

来源：评论

学校读者我要写书评

暂无评论

Streaming End-to-End Speech Recognition for Hybrid RNN-T/attention Architecture 22

Streaming End-to-End Speech Recognition for Hybrid RNN-T/Att...

引用

Interspeech Conference

作者： Moriya, Takafumi Tanaka, Tomohiro Ashihara, Takanori Ochiai, Tsubasa Sato, Hiroshi Ando, Atsushi Masumura, Ryo Delcroix, Marc Asami, Taichi NTT Corp Chiyoda City Tokyo Japan

ISBN: (纸本)9781713836902

We present a novel architecture with its decoding approach for improving recurrent neural network-transducer (RNN-T) performance. RNN-T is promising for building time-synchronous automatic speech recognition (ASR) systems and thus enhancing streaming ASR applications. We note that encoder-decoderbased sequence-to-sequence models (S2S) have been also used successfully by the ASR community. In this paper, we integrate these popular models in the RNN-T+S2S approach;higher recognition performance than either is achieved due to their integration. However, it is generally deemed to be complicated to use S2S in streaming systems, because the attention mechanism can use arbitrarily long past and future contexts during decoding. Our RNN-T+S2S is composed of the shared encoder, an RNN-T decoder and a triggered attention-based decoder which uses time restricted encoder outputs for attention weight computation. By using the trigger points generated from RNN-T outputs, the S2S branch of RNN-T+S2S activates only when the triggers are detected, which makes streaming ASR practical. Experiments on public and private datasets created to research various tasks demonstrate that our proposal can yield superior recognition performance.

关键词： speech recognition end-to-end recurrent neural network-transducer attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-TO-END SPEECH RECOGNITION

INTERNAL LANGUAGE MODEL ESTIMATION FOR DOMAIN-ADAPTIVE END-T...

引用

IEEE Spoken Language Technology Workshop (SLT)

作者： Meng, Zhong Parthasarathy, Sarangarajan Sun, Eric Gaur, Yashesh Kanda, Naoyuki Lu, Liang Chen, Xie Zhao, Rui Li, Jinyu Gong, Yifan Microsoft Corp Redmond WA 98052 USA

ISBN: (纸本)9781728170664

The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.

关键词： Speech recognition language model recurrent neural network transducer attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

END-TO-END SILENT SPEECH RECOGNITION WITH ACOUSTIC SENSING

END-TO-END SILENT SPEECH RECOGNITION WITH ACOUSTIC SENSING

引用

IEEE Spoken Language Technology Workshop (SLT)

作者： Luo, Jian Wang, Jianzong Cheng, Ning Jiang, Guilin Xiao, Jing Ping An Technol Shenzhen Co Ltd Shenzhen Peoples R China

ISBN: (纸本)9781728170664

Silent speech interfaces (SSI) has been an exciting area of recent interest. In this paper, we present a non-invasive silent speech interface that uses inaudible acoustic signals to capture people's lip movements when they speak. We exploit the speaker and microphone of the smartphone to emit signals and listen to their reflections, respectively. The extracted phase features of these reflections are fed into the deep learning networks to recognize speech. And we also propose an end-to-end recognition framework, which combines the CNN and attention-based encoder-decoder network. Evaluation results on a limited vocabulary (54 sentences) yield word error rates of 8.4% in speaker-independent and environment-independent settings, and 8.1% for unseen sentence testing.

关键词： silent speech interfaces inaudible acoustic signals attention-based encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

TREE-CONSTRAINED POINTER GENERATOR FOR END-TO-END CONTEXTUAL SPEECH RECOGNITION

TREE-CONSTRAINED POINTER GENERATOR FOR END-TO-END CONTEXTUAL...

引用

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

作者： Sun, Guangzhi Zhang, Chao Woodland, Philip C. Univ Cambridge Engn Dept Trumpington St Cambridge CB2 1PZ England

ISBN: (纸本)9781665437394

Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut between the tree and the final ASR output distribution to facilitate recognising biasing words during decoding. Systems were trained and evaluated on the Librispeech corpus where biasing words were extracted at the scales of an utterance, a chapter, or a book to simulate different application scenarios. Experimental results showed that TCPGen consistently improved word error rates (WERs) compared to the baselines, and in particular, achieved significant WER reductions on the biasing words. TCPGen is highly efficient: it can handle 5,000 biasing words and distractors and only add a small overhead to memory use and computation cost.

关键词： pointer generator contextual speech recognition attention-based encoder-decoder transducer end-to-end

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：