Speech communication has several steps of production (encoding), transmission, and hearing (decoding). In every step, acoustic and static distortions are involved inevitably by differences of gender, age, microphone, ...
详细信息
Speech communication has several steps of production (encoding), transmission, and hearing (decoding). In every step, acoustic and static distortions are involved inevitably by differences of gender, age, microphone, room, line, auditory characteristics, etc. In spite of these variations, human listeners can extract linguistic information from speech so easily as if the variations do not disturb the communication at all. One may hypothesize that listeners modify their internal acoustic models whenever either of a speaker, a room, a microphone, or a line is changed. Another one may hypothesize that the linguistic information in speech can be represented separately from the extra-linguistic factors. In this study, being inspired from infants' behaviors and animals' behaviors, our solution to the intrinsic and inevitable variations in speech is described [1,2,3]. Speech structures, invariant to these variations, are derived as completely transform-invariant features [4] and their linguistic and psychological validity is discussed here. Further, some speech applications of ASR [3] and CALL [5] using the structures are shown, where extremely robust performance with speaker variability can be obtained with speech structures.
A three-dimensional (3D) physiological articulatory model has been developed to account for effects of biomechanical properties of speech organs in speech production [1]. In order to control the model investigate the ...
详细信息
A three-dimensional (3D) physiological articulatory model has been developed to account for effects of biomechanical properties of speech organs in speech production [1]. In order to control the model investigate the mechanism of speech production, an efficient control module is necessary to estimate muscle activation patterns, which is used to manipulate the 3D physiological articulatory model, according to desired articulatory posture. For this purpose, a feedforward control strategy is elaborated by mapping articulatory target to corresponding muscle activation pattern via the intrinsic representation of vowel articulation. In this process, the articulatory postures are, first, mapped to corresponding intrinsic representations;second, the articulatory postures are clustered in the space of intrinsic representations;third, for each cluster, a nonlinear function is approximated to map the intrinsic representation of vowel articulation to muscle activation pattern by using General Regression Neural Network (GRNN). The results show that the proposed feedforward control module is able to manipulate the proposed 3D physiological articulatory model for vowel production with high accuracy both acoustically and articulatorily.
This paper introduces a speech-to-singing synthesis system, called SingBySpeaking, which can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based o...
详细信息
This paper introduces a speech-to-singing synthesis system, called SingBySpeaking, which can synthesize a singing voice, given a speaking voice reading the lyrics of a song and its musical score. The system is based on the speech manipulation system STRAIGHT and is comprised of four models controlling three acoustic parameters: the fundamental frequency (F0), phoneme duration, and spectrum. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four types of F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of phoneme in the speaking voice by taking into consideration the duration of its musical note. The spectral-control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. SingBySpeaking enables us to synthesize natural singing voices merely by reading the lyrics of a song and to better understand differences between speaking and singing voices.
Our goal is to represent commonsense knowledge as computational models, which are applied to spoken dialogue systems that realize smart man-machine communications by correctly understanding speakers' intentions an...
详细信息
Our goal is to represent commonsense knowledge as computational models, which are applied to spoken dialogue systems that realize smart man-machine communications by correctly understanding speakers' intentions and emotions. For this purpose, we have constructed a multimodal speech behavior corpus which includes metadata annotated from various viewpoints, such as: utterances, actions, emotions and thinking, for analyzing behavioral factors in thinking processes from various perspectives in everyday life. This paper describes a methodology of modeling of thinking processes in problem solving based on child development, by analyzing multimodal interaction data stored the corpus. We have especially focused on demonstrative expressions behavior which has a role as a signal when communicating with other people. We formulated a hypothesis on developmental process in children which links physical expression skills and mental situations such as attentive ability and sociality. Based on this hypothesis, we constructed a demonstrative expression model which is utilized to visualize actual scenes. The results of our analysis show that our proposed method is an effective way of in-depth analysis of thinking processes in demonstrative expressions behavior.
Research on Thai language processing has been investigated since the early 80’s;however the performances of current applications that employ human language technologies are still far from market expectation. Shortage...
详细信息
Research on Thai language processing has been investigated since the early 80’s;however the performances of current applications that employ human language technologies are still far from market expectation. Shortage of shared resources and lack of standard evaluation guidelines are among the causes that hinder the progress in this field. As a national research organization of Thailand, the National Electronics and Computer Technology Center (NECTEC) has taken responsibility in developing and sharing language resources for research and education purposes since 1997. For speech processing research, a variety of speech corpora are necessary as acoustic characteristics of speech signals, conditions of input channels, and application domains are diverse. Similar to other major languages, our development started from a read-speech corpus collected in a controlled environment. The subsequent distributions, however, aim more toward spontaneous speech collected in real environments such as telephone speech and broadcast news speech. In addition, we also design and construct speech corpora that cover extensive acoustic events, such as phone sequences and intonation patterns, suitable for speech analysis and speech synthesis research. A larger collection of speech corpora will help driving speech technology research in Thailand toward real-world applications. NECTEC also takes initiation in setting up standards for various issues in Thai language processing. Together with experts from various universities and organizations, we have organized BEST (Benchmark for Enhancing the Standard of Thai language processing), a series of contests on Thai language processing such as word segmentation and named-entity recognition. With a standard evaluation protocol and annotation guidelines along with a large amount of annotated data provided, the BEST events can help accelerate the progress of Thai language processing technologies through knowledge and resource sharing and the benchmarkin
In China, there are many different kinds of dialects and sub-dialects. Because there are many grammatical, lexical, phonological, and phonetic differences among them in varying degrees, people from different dialect r...
详细信息
In China, there are many different kinds of dialects and sub-dialects. Because there are many grammatical, lexical, phonological, and phonetic differences among them in varying degrees, people from different dialect regions always have difficulties in oral communication. Since 1956, standard Mandarin has been popularized all over the country as official language and almost every dialect speaker began to learn Mandarin just as a second language. But affected by their native dialects, many of them speak Mandarin with regional accents. In modern speech processing technologies, speech is represented by spectrum which contains not only the dialectal linguistic information but also extra-linguistic information such as the gender and age of the speaker. In order to focus exclusively on the linguistic features of dialectal utterances, a speaker-invariant structural representation of speech, which was originally proposed by the second author inspired by infants' language acquisition [1, 2], is proposed to represent the pronunciation of Chinese dialect speakers. Since the purely dialectal information can be extracted by removing the extra-linguistic information from dialect speech, this pronunciation structure can be applied to estimate which dialect or sub-dialect region a speaker belongs to and to assess the pronunciation. In order to testify the validity of our approach, speaker classification based on the dialectal utterances of 38 Chinese finals are investigated especially in terms of robustness to speaker variability. The result is linguistically reasonable and highly independent of age and gender. After that, a sub-dialect corpus is developed with a list of characters as reading materials, which is originally used for linguists' investigation of dialect speakers' pronunciation. Then after the sub-dialect pronunciation structure is built for every speaker, their pronunciations are classified based on the distances among their structures. The result shows that the sub-di
暂无评论