This paper introduces the theory of factor analysis of the mixture of Auto-Associative Neural Networks (AANNs) with application in speaker verification. First, we formulate the problem of learning a low-dimensional su...
详细信息
The impact of Short Utterances in Speaker Recognition is of significant importance. Despite the advancements in short utterance speaker recognition (SUSR), text dependence and the role of phonemes in carrying speaker ...
详细信息
Short Utterance Speaker Recognition (SUSR) is an important area of speaker recognition when only small amount of speech data is available for testing and training. We list the most commonly used state-of-the-art metho...
详细信息
We present a new approach of using Auto-Associative Neural Networks (AANNs) in the conventional GMM speaker verification framework with i-vector feature extraction and PLDA modeling. In this technique, an i-vector fea...
详细信息
We introduce a new approach to training multilayer perceptrons (MLPs) for large vocabulary continuous speech recognition (LVCSR) in new languages which have only few hours of annotated in-domain training data (for exa...
详细信息
We introduce a new approach to training multilayer perceptrons (MLPs) for large vocabulary continuous speech recognition (LVCSR) in new languages which have only few hours of annotated in-domain training data (for example, 1 hour of data). In our approach, large amounts of annotated out-of-domain data from multiple languages are used to train multilingual MLP systems without dealing with the different phoneme sets for these languages. Features extracted from these MLP systems are used to train LVCSR systems in the low-resource language similar to the Tandem approach. In our experiments, the proposed features provide a relative improvement of about 30% in an low-resource LVCSR setting with only one hour of training data.
In the real world, natural conversational speech is an amalgam of speech segments, silences and environmental/ background and channel effects. Labeling the different regions of an acoustic signal according to their in...
详细信息
In the real world, natural conversational speech is an amalgam of speech segments, silences and environmental/ background and channel effects. Labeling the different regions of an acoustic signal according to their information levels would greatly benefit all automatic speechprocessing tasks. In the current work, we propose a novel segmentation approach based on a perception-based measure of speech intelligibility. Unlike segmentation approaches based on various forms of voice-activity detection (VAD), the proposed parsing approach exploits higher-level perceptual information about signal intelligibility levels. This labeling information is integrated into a novel multilevel framework for automatic speaker recognition task. The system processes the input acoustic signal along independent streams reflecting various levels of intelligibility and then fusing the decision scores from the multiple steams according to their intelligibility contribution. Our results show that the proposed system achieves significant improvements over standard baseline and VAD-based approaches, and attains a performance similar to the one obtained with oracle speech segmentation information.
In this paper, we propose to evaluate the quality of emotional speech synthesis by means of an automatic emotion identification system. We test this approach using five different parametric speech synthesis systems, r...
详细信息
ISBN:
(纸本)9787560848693
In this paper, we propose to evaluate the quality of emotional speech synthesis by means of an automatic emotion identification system. We test this approach using five different parametric speech synthesis systems, ranging from plain non-emotional synthesis to full re-synthesis of pre-recorded speech. We compare the results achieved with the automatic system to those of human perception tests. While preliminary, our results indicate that automatic emotion identification can be used to assess the quality of emotional speech synthesis, potentially replacing time consuming and expensive human perception tests.
For better performance in multilayer or hierarchical classification of handwritten text, appropriate grouping of similar symbols is very important. Here we aim to develop a reliable grouping schema for the similar loo...
详细信息
For better performance in multilayer or hierarchical classification of handwritten text, appropriate grouping of similar symbols is very important. Here we aim to develop a reliable grouping schema for the similar looking basic characters, numerals and vowel modifiers of Bangla language. We experimented with thickened and thinned segmented handwritten text to compare which type of image is better for which group. For classification we chose Support Vector Machine (SVM) as it outperforms other classifiers in this field. We used both “one against one” and “one against all” strategies for multiclass SVM and compared their performance.
Short Utterance Speaker Recognition (SUSR) is an important area of speaker recognition when only small amount of speech data is available for testing and training. We list the most commonly used state-of-the-art metho...
详细信息
Short Utterance Speaker Recognition (SUSR) is an important area of speaker recognition when only small amount of speech data is available for testing and training. We list the most commonly used state-of-the-art methods of speaker recognition and the significance of prosodic speaker recognition. A short survey of SUSR is hereby conducted, highlighting various methodologies when using short utterances to recognize speakers. We also specify future research directions in the field SUSR which, together with modern technologies and the ongoing research in prosodic speaker recognition, can lead to better results in speaker recognition.
The impact of Short Utterances in Speaker Recognition is of significant importance. Despite the advancements in short utterance speaker recognition (SUSR), text dependence and the role of phonemes in carrying speaker ...
详细信息
The impact of Short Utterances in Speaker Recognition is of significant importance. Despite the advancements in short utterance speaker recognition (SUSR), text dependence and the role of phonemes in carrying speaker information needs further investigation. This paper presents a novel method of using vowel categories for SUSR. We define Vowel Categories (VC's) considering Chinese and English languages. After recognition and extraction of phonemes, the obtained vowels are divided into VC's, which are then used to develop Universal Background VC Models (UBVCM) for each VC. Conventional GMM-UBM system is used for training and testing. The proposed categories give minimum EERs of 13.76%, 14.03% and 16.18% for 3, 2 and 1 second respectively. Experimental results show that in text dependent SUSR, significant speaker-specific information is present at phoneme level. The similar properties of phonemes can be used such that accurate speech recognition is not required, rather Phoneme Categories can be used effectively for SUSR. Also, it is shown that vowels contain large amount of speaker information, which remains undisturbed when VC are employed.
暂无评论