检索结果-内蒙古大学图书馆

IEEE International Conference on Image Processing (ICIP)

作者： Chang, Yen-Cheng Chen, Yi-Chang Chang, Yu-Chuan Yeh, Yi-Ren E SUN Financial Holding Co Ltd Taipei Taiwan Natl Kaohsiung Normal Univ Dept Math Kaohsiung Taiwan

ISBN: (数字)9781665496209

ISBN: (纸本)9781665496209

Excellent text recognition results have been obtained by training recognition models with synthetic images. However, recognizing text from real-world images still faces challenges due to the domain shift between synthetic and real-world text images. One strategy to eliminate this domain difference without manual annotation is unsupervised domain adaptation (UDA). Due to the characteristics of sequential labeling tasks, most popular UDA methods cannot be directly applied to text recognition. To tackle this problem, we proposed a UDA method that minimizes latent entropy on sequence-to-sequence attention-based models with class-balanced self-paced learning. Experimental results show that our proposed framework achieves better recognition results than the existing methods on most UDA text recognition benchmarks. All codes are publicly available(1).

关键词： domain adaptation sequence-to-sequence entropy minimization self-paced learning

来源：评论

学校读者我要写书评

暂无评论

Rescoring sequence-to-sequence Models for Text Line Recognition with CTC-Prefixes 15th

Rescoring Sequence-to-Sequence Models for Text Line Recognit...

引用

15th IAPR International Workshop on Document Analysis Systems (DAS)

作者： Wick, Christoph Zollner, Jochen Gruning, Tobias Planet AI GmbH Warnowufer 60 D-18057 Rostock Germany Univ Rostock Computat Intelligence Technol Lab Dept Math D-18051 Rostock Germany

ISBN: (纸本)9783031065552;9783031065545

In contrast to Connectionist Temporal Classification (CTC) approaches, sequence-to-sequence (S2S) models for Handwritten Text Recognition (HTR) suffer from errors such as skipped or repeated words which often occur at the end of a sequence. In this paper, to combine the best of both approaches, we propose to use the CTC-Prefix-Score during S2S decoding. Hereby, during beam search, paths that are invalid according to the CTC confidence matrix are penalised. Our network architecture is composed of a Convolutional Neural Network (CNN) as visual backbone, bidirectional Long-Short-Term-Memory-Cells (LSTMs) as encoder, and a decoder which is a Transformer with inserted mutual attention layers. The CTC confidences are computed on the encoder while the Transformer is only used for character-wise S2S decoding. We evaluate this setup on three HTR data sets: IAM, Rimes, and StAZH. On IAM, we achieve a competitive Character Error Rate (CER) of 2.95% when pretraining our model on synthetic data and including a character-based language model for contemporary English. Compared to other state-of-the-art approaches, our model requires about 10-20 times less parameters. Access our shared implementations via this link to GitHub.

关键词： Text Line Recognition Handwritten text recognition Document analysis sequence-to-sequence CTC

来源：评论

学校读者我要写书评

暂无评论

An Overview & Analysis of sequence-to-sequence Emotional Voice Conversion 23

An Overview & Analysis of Sequence-to-Sequence Emotional Voi...

引用

Interspeech Conference

作者： Yang, Zijiang Jing, Xin Triantafyllopoulos, Andreas Song, Meishu Aslan, Ilhan Schuller, Bjoern W. Univ Augsburg Chair Embedded Intelligence Hlth Care & Wellbeing Augsburg Germany Univ Tokyo Educ Physiol Lab Tokyo Japan Huawei Technol Device Software Lab Munich Res Ctr Munich Germany Imperial Coll London GLAM Grp Language Audio & Mus London England

Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion;it can thus be a key enabling technology for human-computer interaction applications and beyond. However, EVC remains an unsolved research problem with several challenges. In particular, as speech rate and rhythm are two key factors of emotional conversion, models have to generate output sequences of differing length. sequence-to-sequence modelling is recently emerging as a competitive paradigm for models that can overcome those challenges. In an attempt to stimulate further research in this promising new direction, recent sequence-to-sequence EVC papers were systematically investigated and reviewed from six perspectives: their motivation, training strategies, model architectures, datasets, model inputs, and evaluation methods. This information is organised to provide the research community with an easily digestible overview of the current state-of-the-art. Finally, we discuss existing challenges of sequence-to-sequence EVC.

关键词： affective computing emotional text-to-speech emotional voice conversion sequence-to-sequence

来源：评论

学校读者我要写书评

暂无评论

UnitNet: A sequence-to-sequence Acoustic Model for Concatenative Speech Synthesis

引用

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2021年 29卷 2643-2655页

作者： Zhou, Xiao Ling, Zhen-Hua Dai, Li-Rong Univ Sci & Technol China Natl Engn Lab Speech & Language Informat Proc Hefei 230027 Peoples R China

This paper presents UnitNet, a sequence-to-sequence (Seq2Seq) acoustic model for concatenative speech synthesis. Comparing with the Tacotron2 model for Seq2Seq speech synthesis, UnitNet utilizes the phone boundaries of training data and its decoder contains autoregressive structures at both phone and frame levels. This hierarchical architecture can not only extract embedding vectors for representing phone-sized units in the corpus but also measure the dependency among consecutive units, which makes the UnitNet model capable of guiding the selection of phone-sized units for concatenative speech synthesis. A byproduct of this model is that it can also be applied to statistical parametric speech synthesis (SPSS) and improve the robustness of Seq2Seq acoustic feature prediction since it adopts interpretable transition probability prediction rather than attention mechanism for frame-level alignment. Experimental results show that our UnitNet-based concatenative speech synthesis method not only outperforms the unit selection methods using hidden Markov models and Tacotron-based unit embeddings, but also achieves better naturalness and faster inference speed than the SPSS method using FastSpeech and Parallel WaveGAN. Besides, the UnitNet-based SPSS method makes fewer synthesis errors than Tacotron2 and FastSpeech without naturalness degradation.

关键词： Hidden Markov models Acoustics Decoding Predictive models Speech synthesis Linguistics Computational modeling speech synthesis text-to-speech unit selection sequence-to-sequence Tacotron

来源：评论

学校读者我要写书评

暂无评论

SSS-AE: Anomaly Detection Using Self-Attention Based sequence-to-sequence Auto-Encoder in SMD Assembly Machine Sound

引用

IEEE ACCESS 2021年 9卷 131191-131202页

作者： Nam, Ki Hyun Song, Young Jong Yun, Il Dong Hankuk Univ Foreign Studies Dept Comp Engn Yongin 17035 South Korea

A Surface-Mounted Device (SMD) assembly machine continuously assembles various products in real field. Unwanted situations such as assembly failure and device breakdown can occur at any time during the assembly process and result in costly losses. Anomaly detection techniques using deep learning are effective in detecting such abnormal situations. Two training scenarios, single-product learning and multi-product learning, can be considered for SMD anomaly detection workflows. Since there are not many products in previous studies, single-product learning is sufficient. However, multi-product learning is required when the number of products increases gradually. Successful multi-product learning on various assembly sound data in an industrial environment with limited resources requires efficient and light learning methods. In this paper, we propose robust model and effective data preprocessing method, Self-Attention based sequence-to-sequence Auto-Encoder (SSS-AE) and Temporal Adaptive Average Pooling (TAAP). For more accurate evaluation compared with the previous SMD anomaly detection studies, a new large-scale SMD dataset containing observed real abnormal products were collected and evaluated. As a result, we show that SSS-AE and TAAP are powerful and practical approaches for both single-product learning and multi-product learning.

关键词： Anomaly detection Data models Decoding Training Unsupervised learning Adaptation models Task analysis Anomaly detection auto-encoder self-attention sequence-to-sequence

来源：评论

学校读者我要写书评

暂无评论

Multi-Step Prediction of Wind Power Based on Hybrid Model with Improved Variational Mode Decomposition and sequence-to-sequence Network

引用

PROCESSES 2024年第1期12卷 191页

作者： Bai, Wangwang Jin, Mengxue Li, Wanwei Zhao, Juan Feng, Bin Xie, Tuo Li, Siyao Li, Hui Econ & Tech Res Inst State Grid Gansu Power Co Lanzhou 730050 Peoples R China State Grid Changzhi Power Supply Co Changzhi 046011 Peoples R China Northwest Power Design Inst Co Ltd China Power Engn Consultant Grp Xian 710075 Peoples R China Xian Univ Technol Sch Elect Engn Xian 710048 Peoples R China

Due to the complexity of wind power, traditional prediction models are incapable of fully extracting the hidden features of multidimensional strong fluctuation data, which results in poor multi-step prediction performance. To predict continuous power effectively in the future, an improved wind power multi-step prediction model combining variational mode decomposition (VMD) with sequence-to-sequence (Seq2Seq) is proposed. Firstly, the wind power sequence is smoothed using VMD and the decomposition parameters of VMD are optimized by using the squirrel search algorithm (SSA) to effectively optimize the decomposition effect. Then, the subsequence obtained from decomposition, together with the original wind power data, is reconstructed into multivariate time series features. Finally, a Seq2Seq model is constructed, and convolutional neural networks (CNNs) with bidirectional gate recurrent units (BiGRUs) are used to learn the coupling and timing relationships of the input data and encode them. The gate recurrent unit (GRU) is decoded to achieve continuous power prediction. Based on the actual operating data of a wind farm, a case analysis is conducted. Experimental results show that SSA-VMD can effectively optimize the decomposition effect, and the subsequences obtained with its decomposition are highly accurate when applied to predictions. The Seq2Seq model has better multi-step prediction results than traditional prediction methods, and as the prediction step size increases, the advantages are more obvious.

关键词： convolutional neural network multi-step prediction of wind power sequence-to-sequence squirrel search algorithm variational mode decomposition

来源：评论

学校读者我要写书评

暂无评论

AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE sequence-to-sequence VOICE CONVERSION 47

AN INVESTIGATION OF STREAMING NON-AUTOREGRESSIVE SEQUENCE-TO...

引用

47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Hayashi, Tomoki Kobayashi, Kazuhiro Toda, Tomoki TARVO Inc Nagoya Aichi Japan Nagoya Univ Nagoya Aichi Japan

ISBN: (纸本)9781665405409

Recent advances in sequence-to-sequence (S2S) models have improved the quality of voice conversion (VC), but it requires the entire sequence to perform inference, which prevents using it in real-time applications. To address this issue, this paper extends the non-autoregressive (NAR) S2S-VC model to enable us to perform streaming VC. We introduce streamable architectures such as causal convolution and self-attention with causal masking for the FastSpeech2-based NAR-S2S-VC model. The streamable architecture also tries to convert durations, which are kept as is in conventional real-time VC methods. To further improve the performance of the streaming VC model, we utilize an instant knowledge distillation with a dual-mode architecture, which performs non-causal and causal inference by sharing the network parameters. Through the experimental evaluation with Japanese parallel corpus, we investigate the impact on performance caused by the streamable architecture. The experimental results reveal that the use of future context frames increases latency, but it improves the conversion quality and that the difference in the speaking rate affects the performance of streaming inference.

关键词： Voice conversion streaming non-autoregressive sequence-to-sequence

来源：评论

学校读者我要写书评

暂无评论

A Realistic Drum Accompaniment Generator Using sequence-to-sequence Model and MIDI Music Database 30

A Realistic Drum Accompaniment Generator Using Sequence-to-S...

引用

30th IEEE Signal Processing and Communications Applications Conference (SIU)

作者： Akyuz, Yavuz Batuhan Gumustekin, Sevket Izmir Yuksek Teknol Enstitusu Elekt Elekt Muhendisligi TR-35430 Urla Izmir Turkiye

ISBN: (纸本)9781665450928

In this work, artificial intelligence reinterpretation and/or addition of drum parts for musical pieces supplied in Musical Instruments Digital Interface (MIDI) format, have been carried out. To achieve this, sequence-to-sequence learning method and Encoder-Decoder Long Short-Term Memory (LSTM) artificial neural network model have been used. In order to improve training of this neural network, teacher forcing method was utilized. In the generation of new drum parts, the quality and the originality of the samples were improved by using temperature sampling. Our proposed method produces high quality drum accompaniments with adjustable complexity.

关键词： MIDI sequence-to-sequence encoder and decoder long-short term memory teacher forcing temperature sampling autonomous music accompany

来源：评论

学校读者我要写书评

暂无评论

A Hierarchical sequence-to-sequence Model for Korean POS Tagging

引用

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2021年第2期20卷 1–13页

作者： Jin, Guozhe Yu, Zhezhou Jilin Univ Coll Comp Sci & Technol Qianjin St 2699 Changchun Jilin Peoples R China

Part-of-speech (POS) tagging is a fundamental task in natural language processing. Korean POS tagging consists of two subtasks: morphological analysis and POS tagging. In recent years, scholars have tended to use the seq2seq model to solve this problem. The full context of a sentence is considered in these seq2seq-based Korean POS tagging methods. However, Korean morphological analysis relies more on local contextual information, and in many cases, there exists one-to-one matching between morpheme surface form and base form. To make better use of these characteristics, we propose a hierarchical seq2seq model. In our model, the low-level Bi-LSTM encodes the syllable sequence, whereas the high-level Bi-LSTM models the context information of the whole sentence, and the decoder generates the morpheme base form syllables as well as the POS tags. To improve the accuracy of the morpheme base form recovery, we introduced the convolution layer and the attention mechanism to our model. The experimental results on the Sejong corpus show that our model outperforms strong baseline systems in both morpheme-level F1-score and eojeol-level accuracy, achieving state-of-the-art performance.

关键词： Korean POS tagging sequence-to-sequence hierarchical convolution

来源：评论

学校读者我要写书评

暂无评论

Pretraining Techniques for sequence-to-sequence Voice Conversion

引用

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2021年 29卷 745-755页

作者： Huang, Wen-Chin Hayashi, Tomoki Wu, Yi-Chiao Kameoka, Hirokazu Toda, Tomoki Nagoya Univ Grad Sch Informat Nagoya Aichi 4648601 Japan Nagoya Univ Human Dataware Lab Co Ltd Nagoya Aichi 4648601 Japan Nagoya Univ Grad Sch Informat Sci Nagoya Aichi 4648601 Japan NTT Corp NTT Commun Sci Labs Atsugi Kanagawa 2430198 Japan Nagoya Univ Informat Technol Ctr Nagoya Aichi 4648601 Japan

sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. In this work, we examine our proposed method in a parallel, one-to-one setting. We employed recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity.

关键词： Task analysis Speech processing Decoding Training Data models Training data Spectrogram Voice conversion sequence-to-sequence pretraining transformer

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：