This paper presents a model of a lower level contextual effect that can cope with coarticulation problems, especially vowel neutralization. The model is constructed to overshoot spectral peak trajectories based on spe...
This paper presents a model of a lower level contextual effect that can cope with coarticulation problems, especially vowel neutralization. The model is constructed to overshoot spectral peak trajectories based on spectral peak interaction, assuming that the lower level contextual effect is represented as the sum of interaction between each spectral peak pair. The interaction function is determined experimentally in order to reduce the distance between a real spectral peak and its target which is a spectral peak mean computed for vowel uttered in isolation. The interaction function thus determined suggests that: (1) there can be a time‐frequency lateral inhibition in the auditory system like that on the retina in the visual system, (2) the interaction function is consistent with the results of psychaocoustic experiments concerning the assimilation and/or contrast effect using paired single formant stimuli, and (3) the contextual effect between adjacent phonemes can be represented as the sum of the assimilation and/or contrast effects between each spectral peak pair. Applying the determined interaction function to real speech data to cope with coarticulation problem, spectral peak trajectories overshoot, spectral peaks at the vowel center approach their own targets, and the distance between each vowel category pair increases.
The McGurk effect is a phenomenon that demonstrates a perceptual fusion between auditory and visual (lip‐read) information in speech perception under visual‐auditory discrepancy condition (using dubbed video tapes)....
The McGurk effect is a phenomenon that demonstrates a perceptual fusion between auditory and visual (lip‐read) information in speech perception under visual‐auditory discrepancy condition (using dubbed video tapes). This paper examined the relation between the “McGurk effects” and the intelligibility of auditory stimuli. A female narrator's speech was video taped for ten Japanese syllables (/bad,/pad,/mad, /wad,/da/,/tad,/na/,/ra/,/ga,/ka/). The video and audio signals for these ten syllables were combined, resulting in 100 audio‐visual stimuli. These stimuli were presented to ten subjects who were required to identify the stimuli as heard speech in both noisy and noise‐free conditions. For both conditions, the intelligibility of the auditory stimuli was measured, presenting the auditory stimuli alone. In the noise‐free condition, the McGurk effect was small and found only in conditions in which the intelligibility of the auditory stimuli was not 100%. In the noisy condition, the McGurk effect was very strong and widespread. These results suggest that incomplete intelligibility of auditory stimuli is necessary for the McGurk effect.
This study investigates the perceptual characteristics of American English /r,l/ for Japanese listeners using synthesized stimuli. Five major findings are obtained. (1) The Japanese listeners identify the stimuli usin...
This study investigates the perceptual characteristics of American English /r,l/ for Japanese listeners using synthesized stimuli. Five major findings are obtained. (1) The Japanese listeners identify the stimuli using a variety of acoustic cues, and their response patterns are strongly influenced by acoustic features of the stimuli. In contrast, the American listeners can identify /r/ and /l/ as long as a primary cue remains, even under the condition where some of the acoustic cues are missing. (2) As the Japanese listeners tend to perceive some stimuli as /w/ more than American listeners do, perception experiments with /w/ as well as /r/ and /l/ for a choice of identification better clarify the perception mode of the Japanese listeners. (3) A positive relationship between the identification ability of the natural /r,l,w/ spoken by native Americans and that of the synthesized /r,l,w/ is found for the Japanese listeners. (4) Contextual effects in words are very strong for the Japanese listeners when trying to identify /r/ and /l/. (5) The Japanese listeners who have lived in English‐speaking countries before a certain age are able to identify /r/ and /l/ as well as native Americans.
Segmental duration of each phoneme changes depending upon the speaking rate. Generally, vowel parts are easier to be compressed or expanded than consonant parts are in fast or slow speech, respectively. Questions rais...
Segmental duration of each phoneme changes depending upon the speaking rate. Generally, vowel parts are easier to be compressed or expanded than consonant parts are in fast or slow speech, respectively. Questions raised in this paper include how the speaking rate can be extracted from the speech signal without knowing the content (i.e., phonetic information) and what kind of time‐scale modification can be chosen in order to control speaking rate. First, the segmental duration compressibility of the speech signal was defined by path slopes in DTW spectral matching when utterances with various kinds of speaking rates were matched to a reference utterance of a normal speaking rate. On the assumption that the compressibility is inversely proportional to segmental spectrum changes, the relationship between the compressibility and the average cepstral time difference Δcep [S. Furui, IEEE Trans. Acoust. Speech Signal Process. ASSP‐34, 52–59 (1986)] was studied. The results showed that the Δcep is an efficient parameter to represent the compressibility. By using the Δcep as a control parameter of the speaking rate, nonlinear time‐scale modification can be achieved without speech quality degradation.
For the purpose of a natural and high‐quality speech synthesis, the role of prosody in speech perception has been studied. Prosodic components, which contribute to the expression of emotions and their intensity, were...
For the purpose of a natural and high‐quality speech synthesis, the role of prosody in speech perception has been studied. Prosodic components, which contribute to the expression of emotions and their intensity, were clarified by analyzing emotional speech and by performing listening tests on synthetic speech. It has been confirmed that prosodic components, which are composed of pitch structure, temporal structure, and amplitude structure, contribute to the expression of emotions more than the spectral structure of speech. Listening test results also showed that the temporal structure was the most important for the expression of anger, while both amplitude structure and pitch structure were much more important for the intensity of anger. Pitch structure also played a significant role in the expression of joy and its intensity. These results suggest the possibility of converting a neutral utterance (i.e., one with no particular emotion) into utterances expressing various kinds of emotions. These results can also be applied to controlling the emotional characteristics of speech in synthesis by rule.
Japanese word accent is known to be characterized by changes in vocal pitch. For speech with an electrolarynx of transcervical type, production of pitch accent is a difficult speech task, since most electrolarynges ar...
Japanese word accent is known to be characterized by changes in vocal pitch. For speech with an electrolarynx of transcervical type, production of pitch accent is a difficult speech task, since most electrolarynges are not designed to facilitate quick pitch changes. However, skilled Japanese electrolarynx talkers manage to produce contrasts of word accent to a certain extent. This paper reports some perceptual cues and their effects for word accent produced by Japanese electrolarynx talkers. Subjects are four male laryngectomees who use electrolarynges with good speech intelligibility. Audio recordings were made during multiple repetitions of ten types of word pairs. Each pair consisted of two‐mora words with identical phoneme sequences and contrasting accent types. Acoustic analyses revealed that the talkers successfully produced accent contrasts mainly by changing the segmental duration. Discussion will be made on some speech strategies that are not used by normal talkers for producing major perceptual cues but which are sometimes effective for the speech handicapped population for whom normal cues are not available.
Teuvo Kohonen has recently developed an algorithm similar to that used in his feature map classifiers but in which learning is supervised rather than unsupervised. This algorithm, known as learning vector quantization...
Teuvo Kohonen has recently developed an algorithm similar to that used in his feature map classifiers but in which learning is supervised rather than unsupervised. This algorithm, known as learning vector quantization (LVQ), is similar to a K‐nearest neighbor algorithm and allows a system to learn the vector quantization of the inputs to different categories. This algorithm is very simple, does not require a large number of training trials, and is capable of forming complex decision regions. As a recognition task, the speaker‐dependent recognition of the phonemes /b/, /d/, and /g/ in different phonetic contexts is considered. The training procedure is applied to speech patterns that are stepped through in time, thus providing the system with a measure of shift invariance. Preliminary results indicate that LVQ can yield a recognition rate of 98.3% for 1880 testing tokens from three speakers. The simple vector operations that constitute the core of LVQ allowed for very easy parallelization and thus high learning speed, i.e., less than an hour.
Vowel identification tests were carried out using 200 synthesized vowel‐like stimuli to examine the role of the fundamental frequency F0 in vowel perception. These stimuli were synthetic versions of the five Japanese...
Vowel identification tests were carried out using 200 synthesized vowel‐like stimuli to examine the role of the fundamental frequency F0 in vowel perception. These stimuli were synthetic versions of the five Japanese vowels, /i/, /e/, /a/, /o/, and /u/, of which the F0 and/or the formant frequencies Fi (i = 1,2,3,4) were modified: ten F0 values were formed by adding n/3 Bark (n = 0,1,…,9) to the original F0. Four formant frequency sets were formed by adding m Bark (m = 0,1,2,3) to the original formant frequencies for each vowel. The results are the following: (1) perceived vowel height articulation shifts upward when the F0 shifts upward, while all formant frequencies remain the same: (2) this shift in vowel height is more distinct amid mid and low vowels than for high vowels; and (3) vowel height does not change when the F0 as well as all formant frequencies are shifted upward the same amount along the Bark scale. Further results, along with the hypothesis that a high F0 is regarded as the first formant in middle and low vowel perception, will be discussed.
It was revealed that a trained human spectrogram reader could perform accurate speech labeling, and that accuracy was based on the flexibility of his/her decision process using many kinds of spectrographic features [S...
It was revealed that a trained human spectrogram reader could perform accurate speech labeling, and that accuracy was based on the flexibility of his/her decision process using many kinds of spectrographic features [S. Katagiri, SP87‐115, IEICE Tech. Rep. (1988)]. In this paper, a new flexible speech labeling system that simulates the trained reader capability is proposed. The main task of the system is to apply the trial‐and‐error process used in a human reader's labeling work. Therefore, a relaxation method is adopted here [S. Katagiri, 2‐1‐19, A. S. J. Spring Meeting (1988)]. The system consists of three parts: an acoustic analyzer, a verifier, and a supervisor. In the acoustic analyzer, many kinds of acoustic feature candidates, e.g., formant and pitch frequencies, are calculated. In the verifier, possible speech labels are verified. The supervisor, with a behavior principle based on the relaxation method, controls the whole system. Experimental results show that the system's performance is comparable to a human reader's performance and that very accurate labels are automatically created.
Perceptual units for category identification of infant cries have been studied. Three cry categories discussed in this paper are the hunger cry, the call cry (i.e., cry calling for infant‐mother interaction), and the...
Perceptual units for category identification of infant cries have been studied. Three cry categories discussed in this paper are the hunger cry, the call cry (i.e., cry calling for infant‐mother interaction), and the anger cry. The original samples have been classified into these three categories. In order to generate stimuli to be used in the perceptual experiments, each of the cry samples is first segmented into single‐segment units according to breath groups of the cry samples. Next, the single‐segment units are combined with each other in temporal order to generate two‐, three‐, five‐, and seven‐segment unit stimuli. In addition to the multisegment unit stimuli thus obtained, the single‐segment units and the three original samples are used as one‐segment unit stimuli and full‐segment unit stimuli, respectively, in the perceptual experiments. In identifying the cry stimuli, subjects are instructed to make a forced choice among the three cry categories. The experimental results show that category identification rates are greatly dependent upon the number of segments making up each stimulus. However, the identification rates are temporarily saturated at two‐segment units in the call cry and at three‐ to five‐segment units in the hunger and anger cries. This fact indicates that the units with two to five segments are the perceptual units. Temporal duration of the perceptual units across all three categories is similar (i.e., about 6–8 s).
暂无评论