In recent years, the Minzu University of China has established several speech databases of some ethnic languages under the funding of national "211" and "985" programs. These databases include &quo...
详细信息
In recent years, the Minzu University of China has established several speech databases of some ethnic languages under the funding of national "211" and "985" programs. These databases include "Speech Corpus of Chinese Endangered Ethnic Languages", "The Interlanguage Corpus of Tibetan Speaking Mandarin Chinese", "Phonetic File of Ethnic Languages in China", and "Multimedia Archive in International Phonetic Alphabet of Ethnic Languages in China" etc. The establishment of these databases is required by the need of protecting intangible culture of ethnic languages in China, developing information technology, promoting Putonghua, language teaching of national languages, and linguistics study, etc. This article will provide basic information on the establishment and study of these databases and the introduction to the supporting platform for construction of language database.
This paper reviews our proposed approach to voice conversion (VC) and voice quality control based on an eigenvoice technique. VC is a technique to modify nonlinguistic information such as speaker individuality while k...
详细信息
This paper reviews our proposed approach to voice conversion (VC) and voice quality control based on an eigenvoice technique. VC is a technique to modify nonlinguistic information such as speaker individuality while keeping linguistic information unchanged. In the traditional VC framework, a conversion model for a source and target speaker-pair needs to be trained in advance using a parallel data set consisting of utterance-pairs of these two speakers. To make VC technologies more practical, we have developed a new VC paradigm for flexibly building the conversion model for an arbitrary speaker-pair by effectively using speech samples of many other speakers. In this paper, we give an overview of eigenvoice conversion (EVC) as one of our proposed VC techniques.
This paper presents the development of the speech, text and pronunciation dictionary resources required to build a large vocabulary speech recognizer for the Malay language. This project is a collaboration project amo...
详细信息
This paper presents the development of the speech, text and pronunciation dictionary resources required to build a large vocabulary speech recognizer for the Malay language. This project is a collaboration project among three universities: USM, MMU from Malaysia and NTU from Singapore. The Malay speech corpus consists of read speech (speaker independent/ dependent and accent independent/ dependent) and broadcast news. To date, 90 speakers have been recorded which is equal to a total of nearly 70 hours of read speech, and 10 hours of broadcast news from local TV stations in Malaysia was transcribed. The text corpus consists of 700Mbytes of data extracted from Malaysia’s local news web pages from 1998-2008 and a rule based G2P tool is develop to generate the pronunciation dictionary.
This paper presents a machine learning method that enables robots to learn to communicate linguistically from scratch through verbal and behavioral interaction with users. The method combines speech, visual, and tacti...
详细信息
This paper presents a machine learning method that enables robots to learn to communicate linguistically from scratch through verbal and behavioral interaction with users. The method combines speech, visual, and tactile information ob- tained by interaction in the real world. It learns speech units, words, concepts of objects, motions, grammar, and pragmatic and communicative capabilities, which are integrated in a dy- namic graphical model. Experimental results show that through a practical, small number of learning episodes with a user, the robot was eventually able to understand even fragmental and ambiguous utterances, respond to them with confirmation ques- tions and/or acting, generate directive utterances appropriate for the given situation, and answer questions. This paper discusses the importance of a developmental approach to realize natural situated human-robot conversations.
Singing voice synthesis is one of the hot topics of speech synthesis applications. This paper introduced a speech modification based singing voice synthesis framework and proposed several detailed implementations for ...
详细信息
Singing voice synthesis is one of the hot topics of speech synthesis applications. This paper introduced a speech modification based singing voice synthesis framework and proposed several detailed implementations for improving the quality of synthesized singing voice. Subjective evaluation showed that the proposed methods are effective to enhance the generated singing voice.
In this presentation, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT speech-to-speech translation system, including design of specifications and metho...
详细信息
In this presentation, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT speech-to-speech translation system, including design of specifications and methods of annotation work. Over 500K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, they are the largest conversational textual corpora;in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Comparative analyses on statistical characteristics of corpora, performances of language models and speech recognitions are conducted using these parallel corpora, and the problems with the Chinese corpora and solutions to them are discussed.
Spoken language translation (SLT), which bridges the gap between different languages, plays an important role in speech-to-speech translation. SLT system has to deal with problems such as disfluency, ungrammaticality,...
详细信息
Spoken language translation (SLT), which bridges the gap between different languages, plays an important role in speech-to-speech translation. SLT system has to deal with problems such as disfluency, ungrammaticality, absence of punctuation marks, speech recognition errors, etc. Statistical Machine Translation (SMT) is very popular for its adequate mathematical model, unsupervised learning capacity, and robustness. Rule-Based Machine Translation (RBMT) and Example-Based Machine Translation (EBMT) methods are also valuable. Rules are good at modeling linguistic theory and phenomena. EBMT method is able to translate similar input very well. Hybrid machine translation is becoming more and more important in the machine translation community. This report introduces rule-based, example-based and statistical SLT methods and the combination of them.
The similarity between the speaker characteristics of synthetic speech and the natural voice of target speaker is an important measurement to evaluate the performance of a speech synthesis system. However, the similar...
详细信息
The similarity between the speaker characteristics of synthetic speech and the natural voice of target speaker is an important measurement to evaluate the performance of a speech synthesis system. However, the similarity performance of the HMM-based parametric speech synthesis method is generally unsatisfactory. This paper studies the factors that may cause such similarity degradation by a group of subjective listening tests. These factors include the influence of speech vocoder, the statistical model based generation of duration, F0 and spectral parameters, and rapid voice building by model adaptation. The results show that the generated duration and spectrum parameters are the major reasons that impair the speaker characteristics of synthetic speech. Hence, we experiment by increasing the number of distributions for context-dependent phone duration modeling and introducing unit selection method into spectral parameter generation to improve the similarity performance. The results of listening tests prove that these two means can improve the similarity performance effectively and preserve the naturalness of synthetic speech at the same time.
This paper presents a perception experiment to measure the ability of Japanese children in fourth and fifth grade elementary school to recognize culturally encoded expressions of politeness and impoliteness in their n...
详细信息
This paper presents a perception experiment to measure the ability of Japanese children in fourth and fifth grade elementary school to recognize culturally encoded expressions of politeness and impoliteness in their native language. Audio-visual stimuli were presented to listeners, who rate the politeness degree and a possible situation where such an expression could be used. Analysis of results focuses on the differences and the similarities between adult listeners and children, for each attitude and modality. Facial information seems to be retrieved earlier than audio ones, and expressions of different degrees of Japanese politeness, including expressions of kyoshuku, are still not understood around 10 years of age.
This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine-Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. ...
详细信息
This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine-Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to Speech Synthesis Markup Language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extension for algorithm research and evaluation of TTS synthesis.
暂无评论