Under-resourced automatic speech recognition (ASR) has become an active field of research and has experienced significant progress during the past decade. However, the performance of under-resourced ASR trained by exi...
详细信息
Speech content is closely related to the stability of speaker embeddings in speaker verification tasks. In this paper, we propose a novel architecture based on self-constraint learning (SCL) and reconstruction task (R...
详细信息
To model the periodicity of beats, state-of-the-art beat tracking systems use 'post-processing trackers' (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work w...
详细信息
Electrolarynx (EL) is a communicative aid for the patient after laryngectomy to generate communicable speech. Since EL speech exhibits low speech intelligibility and produces loud noise, understanding the content of t...
Electrolarynx (EL) is a communicative aid for the patient after laryngectomy to generate communicable speech. Since EL speech exhibits low speech intelligibility and produces loud noise, understanding the content of the speech remains challenging for listeners, even if the patient is proficient in using the EL device. Accordingly, it is important to develop the tools that offer additional communication methods. Automatic speech recognition (ASR) of EL speech emerges as a method worth considering in this regard. However, the problem of under-resourced data dramatically degrades the recognition performance of EL speech. Data augmentation is one of the viable solutions for addressing the issue of under-resourced speech data. However, even with an increased health training corpus, the improvement in EL speech recognition may not be satisfactory. Because the characteristics of the EL speech still differ significantly from those of health speech. This paper proposes a data selection method using the phoneme affinity matrix to prioritize the selection of health speech that closely resembles EL speech for data augmentation. The affinity between two phonemes is defined as the similarity of the Phone Posteriorgrams(PPGs) of the two phonemes, considering the phoneme models. The experimental results demonstrate that the approach utilizing data selection based on the phoneme affinity matrix yields superior results compared to both the baseline and the method employing random sampling to select the augmented health speech corpus.
Multi-channel speech separation has been successfully applied in a complex real-world environment such as the far-field condition. The common solution to deal with the far-field condition is using a multi-channel sign...
详细信息
With the development of E-commerce, an Automated Question-Answering system takes a crucial part in customer service. Question classification, which assigns labels to questions according to the answer types, is one of ...
详细信息
Over the last decades, there has been growing interest in research in multiple and interdisciplinary fields of human-AI computing. In particular, approaches integrating the intersecting design with reinforcement learn...
详细信息
With the development of E-commerce, an Automated Question-Answering system takes a crucial part in customer service. Question classification, which assigns labels to questions according to the answer types, is one of ...
详细信息
ISBN:
(纸本)9781665408264
With the development of E-commerce, an Automated Question-Answering system takes a crucial part in customer service. Question classification, which assigns labels to questions according to the answer types, is one of the tasks in question answering. Previous methods usually used handcraft features like named entity recognition, but it needs the predefined dictionary or tools. The machine learning approaches are recently applied to this task and achieve high accuracy. In this paper, we proposed HAEE, a Hierarchical intra-Attention Enhancement Encoder which composed of bidirectional GRUs and intra-attentions. In addition, we adopt the character input to address the issue of the OOV (Out-Of-Vocabulary) problem and create multiple intra-attentions to simulate the certain relationships between characters (Chinese) or words (English) to enhance the influence of tokens on the sentence. We evaluate the HAEE model in an actual corporate setting and several datasets. As shown in the experimental results, our HAEE model outperforms the existing state-of-the-art models on question classification tasks, especially for the Chinese corpus.
Multi-channel speech separation has been successfully applied in a complex real-world environment such as the far-field condition. The common solution to deal with the far-field condition is using a multi-channel sign...
详细信息
ISBN:
(纸本)9781665441629
Multi-channel speech separation has been successfully applied in a complex real-world environment such as the far-field condition. The common solution to deal with the far-field condition is using a multi-channel signal captured by a structured microphone array and leveraging the inner difference between channels to enhance the speech separation performance. The spatial feature has been widely used in recent speech separation research. This feature appears to be insufficient when the location information becomes ambiguous. This is known as the spatial ambiguity problem. In order to deal with the spatial ambiguity problem, this study proposes an attention mechanism for the Temporal-Spatial Neural Filter (TSNF), in which the channel attention on merged features and the feature map of 1D convolution block in the temporal convolution network is proposed. The proposed method is evaluated on the multi-channel reverberant dataset which is built based on the WSJ0-2mix dataset. The dataset is simulated in the real-environment room by using the Room Impulse Response generator. In the experimental results, the proposed methods produced the SI -SNR improvement of about 1.2dB in close speakers' case, while a small decrease of 0.1dB in other cases.
This paper proposes a novel speaker-specific articulatory feature (AF) extraction model based on knowledge distillation (KD) for speaker recognition. First, an AF extractor is trained as a teacher model for extracting...
详细信息
This paper proposes a novel speaker-specific articulatory feature (AF) extraction model based on knowledge distillation (KD) for speaker recognition. First, an AF extractor is trained as a teacher model for extracting the AF profiles of the input speaker dataset. Next, a KD-based speaker embedding extraction method is proposed to distill the speaker-specific information from the AF profiles in the teacher model to a student model based on multi-task learning, in which the lower layers not only capture the speaker characteristics from acoustic features, but also learn the speaker-specific features from the AF profiles for robust speaker representation. Finally, speaker embeddings are extracted from the high-level layer, and the obtained speaker embeddings are further used to train a probabilistic linear discriminant analysis (PLDA) model for speaker recognition. In the experiments, speaker embedding models were trained using the VoxCeleb2 dataset and the AF extractor was trained based on the LibriSpeech dataset, and the performance was evaluated using the VoxCeleb1 dataset. The experiments showed that the proposed KD-based models outperformed the baseline models without KD. Furthermore, feature concatenation of multimodal results can further improve the performance.
暂无评论