检索结果-内蒙古大学图书馆

Dynamic Sampling-Based Meta-Learning Using Multilingual Acoustic Data for Under-Resourced Speech Recognition

IEEE Access 2024年 12卷 106070-106083页

作者： Hsieh, I-Ting Wu, Chung-Hsien Zhao, Zhe-Hong National Cheng Kung University Graduate Program of Multimedia Systems and Intelligent Computing Tainan70101 Taiwan National Cheng Kung University Department of Computer Science and Information Engineering Tainan70101 Taiwan

Under-resourced automatic speech recognition (ASR) has become an active field of research and has experienced significant progress during the past decade. However, the performance of under-resourced ASR trained by existing methods is still far inferior to high-resourced ASR for practical applications. In this paper, speech data from languages that share the most phonemes with the under-resourced language are selected as supplementary resources for meta-training based on the Model-Agnostic Meta-Learning (MAML) strategy. Besides supplementary language selection, this paper proposes a dynamic sampling method instead of the original random sampling method to select support and query sets for each task in MAML to improve meta-training performance. In this study, Taiwanese is selected as the under-resourced language, and the speech corpus of five languages, including Mandarin, English, Japanese, Cantonese, and Thai, are chosen as supplementary training data for acoustic model training. The proposed dynamic sampling approach uses phonemes, pronunciation, and speech recognition models as the basis to determine the proportion of each supplementary language to select helpful utterances for MAML. For evaluation, with the selected utterances from each supplementary language for meta-training, we obtained a Word Error Rate of 20.24% and a Syllable Error Rate of 8.35% for Taiwanese ASR, which were better than the baseline model (26.18% and 13.99%) using only the Taiwanese corpus and other methods. © 2013 IEEE.

关键词： Speech recognition

来源：评论

学校读者我要写书评

暂无评论

Decomposition and Reorganization of Phonetic Information for Speaker Embedding Learning

引用

IEEE/ACM Transactions on Audio Speech and Language Processing 2023年 31卷 1745-1757页

作者： Hong, Qian-Bei Wu, Chung-Hsien Wang, Hsin-Min National Cheng Kung University and Academia Sinica Graduate Program of Multimedia Systems and Intelligent Computing Tainan701 Taiwan Academia Sinica Taipei115 Taiwan National Cheng Kung University Department of Computer Science and Information Engineering Tainan701401 Taiwan Academia Sinica Institute of Information Science Taipei115 Taiwan

Speech content is closely related to the stability of speaker embeddings in speaker verification tasks. In this paper, we propose a novel architecture based on self-constraint learning (SCL) and reconstruction task (RT) to remove the influence of phonetic information on speaker embedding generation. First, SCL is used to reduce the divergence of frame-level features, which can avoid ambiguity between the resulting embeddings of the two utterances being compared. Second, RT is used to further remove phonetic information in frame-level layers, focusing on speaker-discriminative feature transformation. In our experiments, the speaker embedding models were trained on the VoxCeleb2 dataset and evaluated on the VoxCeleb1, Librispeech, SITW and VoxMovies datasets. Experimental results on VoxCeleb1 show that the proposed DROP-TDNN system reduced the EER by 7.5%, compared to the state-of-the-art ECAPA-TDNN system. Furthermore, the proposed DROP-TDNN system also outperformed the ECAPA-TDNN system in the experiments on SITW, Librispeech and VoxMovies under cross-dataset conditions. In the experiments on SITW, the proposed system reduced the EER by 3.4% compared to the ECAPA-TDNN system. In the experiments on Librispeech, the proposed system demonstrated the advantage of removing phonetic information under the clean speech condition, with a significant reduction of 25.5% in EER compared to the ECAPA-TDNN system. In the experiments on VoxMovies, the proposed system reduced the EER by up to 7.9% compared to the ECAPA-TDNN system under different pronunciation and background conditions. © 2014 IEEE.

关键词： Speech recognition

来源：评论

学校读者我要写书评

暂无评论

Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

引用

IEEE/ACM Transactions on Audio Speech and Language Processing 2023年 31卷 2824-2835页

作者： Chiu, Ching-Yu Muller, Meinard Davies, Matthew E. P. Su, Alvin Wen-Yu Yang, Yi-Hsuan National Cheng Kung University Academia Sinica Graduate Program of Multimedia Systems and Intelligent Computing Tainan701401 Taiwan International Audio Laboratories Erlangen Erlangen91058 Germany University of Coimbra Department of Informatics Engineering Centre for Informatics and Systems Coimbra3004-531 Portugal National Cheng Kung University Department of Computer Science and Information Engineering Tainan701401 Taiwan Taiwan AI Labs Yating Music Team Taipei103622 Taiwan Academia Sinica Research Center for IT Innovation Taipei11529 Taiwan

To model the periodicity of beats, state-of-the-art beat tracking systems use 'post-processing trackers' (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called 'predominant local pulses' (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838). © 2014 IEEE.

关键词： Hidden Markov models

来源：评论

学校读者我要写书评

暂无评论

Data Selection Based on Phoneme Affinity Matrix for Electrolarynx Speech Recognition

Data Selection Based on Phoneme Affinity Matrix for Electrol...

引用

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)

作者： I-Ting Hsieh Chung-Hsien Wu Shu-Wei Tsa Graduate Program of Multimedia Systems and Intelligent Computing National Cheng Kung University Tainan Taiwan Dept. of Otolaryngology National Cheng Kung University Hospital Taiwan

Electrolarynx (EL) is a communicative aid for the patient after laryngectomy to generate communicable speech. Since EL speech exhibits low speech intelligibility and produces loud noise, understanding the content of the speech remains challenging for listeners, even if the patient is proficient in using the EL device. Accordingly, it is important to develop the tools that offer additional communication methods. Automatic speech recognition (ASR) of EL speech emerges as a method worth considering in this regard. However, the problem of under-resourced data dramatically degrades the recognition performance of EL speech. Data augmentation is one of the viable solutions for addressing the issue of under-resourced speech data. However, even with an increased health training corpus, the improvement in EL speech recognition may not be satisfactory. Because the characteristics of the EL speech still differ significantly from those of health speech. This paper proposes a data selection method using the phoneme affinity matrix to prioritize the selection of health speech that closely resembles EL speech for data augmentation. The affinity between two phonemes is defined as the similarity of the Phone Posteriorgrams(PPGs) of the two phonemes, considering the phoneme models. The experimental results demonstrate that the approach utilizing data selection based on the phoneme affinity matrix yields superior results compared to both the baseline and the method employing random sampling to select the augmented health speech corpus.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Improvement of Spatial Ambiguity in Multi-Channel Speech Separation Using Channel Attention

Improvement of Spatial Ambiguity in Multi-Channel Speech Sep...

引用

2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021

作者： Hong, Qian-Bei Wu, Chung-Hsien Nguyen, Thanh Binh Wang, Hsin-Min National Cheng Kung University and Academia Sinica Graduate Program of Multimedia Systems and Intelligent Computing Tainan Taiwan National Cheng Kung University Department of Computer Science and Information Engineering Tainan Taiwan

ISBN: (纸本)9789881476890

Multi-channel speech separation has been successfully applied in a complex real-world environment such as the far-field condition. The common solution to deal with the far-field condition is using a multi-channel signal captured by a structured microphone array and leveraging the inner difference between channels to enhance the speech separation performance. The spatial feature has been widely used in recent speech separation research. This feature appears to be insufficient when the location information becomes ambiguous. This is known as the spatial ambiguity problem. In order to deal with the spatial ambiguity problem, this study proposes an attention mechanism for the Temporal-Spatial Neural Filter (TSNF), in which the channel attention on merged features and the feature map of 1D convolution block in the temporal convolution network is proposed. The proposed method is evaluated on the multi-channel reverberant dataset which is built based on the WSJ0-2mix dataset. The dataset is simulated in the real-environment room by using the Room Impulse Response generator. In the experimental results, the proposed methods produced the SI -SNR improvement of about 1.2dB in close speakers' case, while a small decrease of 0.1dB in other cases. © 2021 APSIPA.

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

HAEE: Question Classification Using Hierarchical Intra-Attention Enhancement Encoder

HAEE: Question Classification Using Hierarchical Intra-Atten...

引用

26th International Conference on Technologies and Applications of Artificial Intelligence, TAAI 2021

作者： Wang, Jen-Wei Chen, Kai-Hsiang Huang, Jen-Wei National Cheng Kung University Department of Electrical Engineering Tainan Taiwan National Cheng Kung University and Academia Sinica Graduate Program for Multimedia Systems and Intelligent Computing Taiwan

ISBN: (纸本)9781665408257

With the development of E-commerce, an Automated Question-Answering system takes a crucial part in customer service. Question classification, which assigns labels to questions according to the answer types, is one of the tasks in question answering. Previous methods usually used handcraft features like named entity recognition, but it needs the predefined dictionary or tools. The machine learning approaches are recently applied to this task and achieve high accuracy. In this paper, we proposed HAEE, a Hierarchical intra-Attention Enhancement Encoder which composed of bidirectional GRUs and intra-attentions. In addition, we adopt the character input to address the issue of the OOV (Out-Of-Vocabulary) problem and create multiple intra-attentions to simulate the certain relationships between characters (Chinese) or words (English) to enhance the influence of tokens on the sentence. We evaluate the HAEE model in an actual corporate setting and several datasets. As shown in the experimental results, our HAEE model outperforms the existing state-of-the-art models on question classification tasks, especially for the Chinese corpus. © 2021 IEEE.

关键词： Signal encoding

来源：评论

学校读者我要写书评

暂无评论

Learning Adaptation and Generalization from Human-Inspired Meta-Reinforcement Learning Using Bayesian Knowledge and Analysis 6

Learning Adaptation and Generalization from Human-Inspired M...

引用

6th IEEE International Conference on Artificial Intelligence and Knowledge Engineering, AIKE 2023

作者： Ho, Joshua Wang, Chien-Min King, Chung-Ta You, Yi-Hsin Feng, Chi-Wei Chen, Yen-Min Kuo, Bo-Yi Institute of Information Science Academia Sinica Taipei11529 Taiwan Social Networks and Human-Centered Computing Program Taiwan International Graduate Program Taiwan Institute of Information Systems and Applications National Tsing Hua University Hsinchu30013 Taiwan Department of Computer Science National Tsing Hua University Hsinchu30013 Taiwan Department of Computer Science National Taiwan University Taipei10631 Taiwan Center of Intelligent Healthcare National Taiwan University Hospital Taipei10022 Taiwan

ISBN: (纸本)9798350331288

Over the last decades, there has been growing interest in research in multiple and interdisciplinary fields of human-AI computing. In particular, approaches integrating the intersecting design with reinforcement learning (RL) have received more attention. However, the current research on RL may need to consider its enhancement from a humaninspired approach further. In the present work, we focus on enabling a meta-reinforcement learning (meta-RL) agent to achieve adaptation and generalization according to modeling Markov decision processes using Bayesian knowledge and analysis. By introducing a novel framework called human-inspired meta-RL (HMRL), we incorporate the agent performing resilient actions to leverage the dynamic dense reward based on the knowledge and prediction of a Bayesian analysis. The proposed framework can make the agent learn generalization and prevent the agent from failing catastrophically. The experimental results show that our approach helps the agent reduce computational costs with learning adaptation. Finally, we conclude and anticipate that integrating human-inspired meta-RL can enable learning more formulations relating to robustness and scalability, leading to promising directions and more complex AI goals in the future. © 2023 IEEE.

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

HAEE: Question Classification Using Hierarchical Intra-Attention Enhancement Encoder

HAEE: Question Classification Using Hierarchical Intra-Atten...

引用

International Conference on Technologies and Applications of Artificial Intelligence (TAAI)

作者： Jen-Wei Wang Kai-Hsiang Chen Jen-Wei Huang National Cheng Kung University Tainan Taiwan Graduate Program for Multimedia Systems and Intelligent Computing National Cheng Kung University and Academia Sinica Taiwan

ISBN: (纸本)9781665408264

关键词： Dictionaries Customer services Machine learning Predictive models Electronic commerce Task analysis

来源：评论

学校读者我要写书评

暂无评论

Improvement of Spatial Ambiguity in Multi-Channel Speech Separation Using Channel Attention

Improvement of Spatial Ambiguity in Multi-Channel Speech Sep...

引用

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)

作者： Qian-Bei Hong Chung-Hsien Wu Thanh Binh Nguyen Hsin-Min Wang Graduate Program of Multimedia Systems and Intelligent Computing National Cheng Kung University and Academia Sinica Tainan Taiwan National Cheng Kung University Tainan Taiwan

ISBN: (纸本)9781665441629

关键词： Convolution Information processing Speech enhancement Information filters Microphone arrays Generators Reliability

来源：评论

学校读者我要写书评

暂无评论

Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition

引用

APSIPA Transactions on Signal and Information Processing 2023年第2期12卷

作者： Qian-Bei Hong Hsin-Min Wang Chung-Hsien Wu Graduate Program of Multimedia Systems and Intelligent Computing National Cheng Kung University and Academia Sinica Taiwan Graduate Program of Multimedia Systems and Intelligent Computing National Cheng Kung University and Academia Sinica and Department of Computer Science and Information Engineering National Cheng Kung University Taiwan

This paper proposes a novel speaker-specific articulatory feature (AF) extraction model based on knowledge distillation (KD) for speaker recognition. First, an AF extractor is trained as a teacher model for extracting the AF profiles of the input speaker dataset. Next, a KD-based speaker embedding extraction method is proposed to distill the speaker-specific information from the AF profiles in the teacher model to a student model based on multi-task learning, in which the lower layers not only capture the speaker characteristics from acoustic features, but also learn the speaker-specific features from the AF profiles for robust speaker representation. Finally, speaker embeddings are extracted from the high-level layer, and the obtained speaker embeddings are further used to train a probabilistic linear discriminant analysis (PLDA) model for speaker recognition. In the experiments, speaker embedding models were trained using the VoxCeleb2 dataset and the AF extractor was trained based on the LibriSpeech dataset, and the performance was evaluated using the VoxCeleb1 dataset. The experiments showed that the proposed KD-based models outperformed the baseline models without KD. Furthermore, feature concatenation of multimodal results can further improve the performance.

关键词： Speaker recognition articulatory feature knowledge distillation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：