检索结果-内蒙古大学图书馆

Modeling of contextal effect based on spectral peak interaction

The Journal of the Acoustical Society of America 2005年第S1期85卷 S86-S86页

作者： Masato Akagi ATR Auditory and Visual Perception Research Laboratories Inuidani Seikacho Kyoto 619‐02 Japan

This paper presents a model of a lower level contextual effect that can cope with coarticulation problems, especially vowel neutralization. The model is constructed to overshoot spectral peak trajectories based on spectral peak interaction, assuming that the lower level contextual effect is represented as the sum of interaction between each spectral peak pair. The interaction function is determined experimentally in order to reduce the distance between a real spectral peak and its target which is a spectral peak mean computed for vowel uttered in isolation. The interaction function thus determined suggests that: (1) there can be a time‐frequency lateral inhibition in the auditory system like that on the retina in the visual system, (2) the interaction function is consistent with the results of psychaocoustic experiments concerning the assimilation and/or contrast effect using paired single formant stimuli, and (3) the contextual effect between adjacent phonemes can be represented as the sum of the assimilation and/or contrast effects between each spectral peak pair. Applying the determined interaction function to real speech data to cope with coarticulation problem, spectral peak trajectories overshoot, spectral peaks at the vowel center approach their own targets, and the distance between each vowel category pair increases.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Effects of lip‐read information on auditory perception of Japanese syllables

引用

The Journal of the Acoustical Society of America 2005年第S1期85卷 S138-S138页

作者： Kaoru Sekiyama Yoh'ichi Tohkura ATR Auditory and Visual Perception Research Laboratories Inuidani Seikacho Kyoto 619‐02 Japan

The McGurk effect is a phenomenon that demonstrates a perceptual fusion between auditory and visual (lip‐read) information in speech perception under visual‐auditory discrepancy condition (using dubbed video tapes). This paper examined the relation between the “McGurk effects” and the intelligibility of auditory stimuli. A female narrator's speech was video taped for ten Japanese syllables (/bad,/pad,/mad, /wad,/da/,/tad,/na/,/ra/,/ga,/ka/). The video and audio signals for these ten syllables were combined, resulting in 100 audio‐visual stimuli. These stimuli were presented to ten subjects who were required to identify the stimuli as heard speech in both noisy and noise‐free conditions. For both conditions, the intelligibility of the auditory stimuli was measured, presenting the auditory stimuli alone. In the noise‐free condition, the McGurk effect was small and found only in conditions in which the intelligibility of the auditory stimuli was not 100%. In the noisy condition, the McGurk effect was very strong and widespread. These results suggest that incomplete intelligibility of auditory stimuli is necessary for the McGurk effect.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Perceptual characteristics of English syllable‐initial /r,l/ for Japanese listeners

引用

The Journal of the Acoustical Society of America 2005年第S1期86卷 S102-S102页

作者： Reiko Yamada Yoh'ichi Tohkura Noriko Kobayashi ATR Auditory and Visual Perception Research Laboratories Inuidani Seika‐cho Kyoto 619‐02 Japan

This study investigates the perceptual characteristics of American English /r,l/ for Japanese listeners using synthesized stimuli. Five major findings are obtained. (1) The Japanese listeners identify the stimuli using a variety of acoustic cues, and their response patterns are strongly influenced by acoustic features of the stimuli. In contrast, the American listeners can identify /r/ and /l/ as long as a primary cue remains, even under the condition where some of the acoustic cues are missing. (2) As the Japanese listeners tend to perceive some stimuli as /w/ more than American listeners do, perception experiments with /w/ as well as /r/ and /l/ for a choice of identification better clarify the perception mode of the Japanese listeners. (3) A positive relationship between the identification ability of the natural /r,l,w/ spoken by native Americans and that of the synthesized /r,l,w/ is found for the Japanese listeners. (4) Contextual effects in words are very strong for the Japanese listeners when trying to identify /r/ and /l/. (5) The Japanese listeners who have lived in English‐speaking countries before a certain age are able to identify /r/ and /l/ as well as native Americans.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Nonlinear time‐scale modification of speech signal with varied segmental duration characteristics

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S15-S15页

作者： Yoh'ichi Tohkura Yoshinori Kitahara ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

Segmental duration of each phoneme changes depending upon the speaking rate. Generally, vowel parts are easier to be compressed or expanded than consonant parts are in fast or slow speech, respectively. Questions raised in this paper include how the speaking rate can be extracted from the speech signal without knowing the content (i.e., phonetic information) and what kind of time‐scale modification can be chosen in order to control speaking rate. First, the segmental duration compressibility of the speech signal was defined by path slopes in DTW spectral matching when utterances with various kinds of speaking rates were matched to a reference utterance of a normal speaking rate. On the assumption that the compressibility is inversely proportional to segmental spectrum changes, the relationship between the compressibility and the average cepstral time difference Δcep [S. Furui, IEEE Trans. Acoust. Speech Signal Process. ASSP‐34, 52–59 (1986)] was studied. The results showed that the Δcep is an efficient parameter to represent the compressibility. By using the Δcep as a control parameter of the speaking rate, nonlinear time‐scale modification can be achieved without speech quality degradation.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Prosodic components of speech in the expression of emotions

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S98-S99页

作者： Yoshinori Kitahara Yoh'ichi Tohkura ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

For the purpose of a natural and high‐quality speech synthesis, the role of prosody in speech perception has been studied. Prosodic components, which contribute to the expression of emotions and their intensity, were clarified by analyzing emotional speech and by performing listening tests on synthetic speech. It has been confirmed that prosodic components, which are composed of pitch structure, temporal structure, and amplitude structure, contribute to the expression of emotions more than the spectral structure of speech. Listening test results also showed that the temporal structure was the most important for the expression of anger, while both amplitude structure and pitch structure were much more important for the intensity of anger. Pitch structure also played a significant role in the expression of joy and its intensity. These results suggest the possibility of converting a neutral utterance (i.e., one with no particular emotion) into utterances expressing various kinds of emotions. These results can also be applied to controlling the emotional characteristics of speech in synthesis by rule.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Control of Japanese pitch accent by electrolarynx talkers

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S83-S83页

作者： Noriko Kobayashi ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

Japanese word accent is known to be characterized by changes in vocal pitch. For speech with an electrolarynx of transcervical type, production of pitch accent is a difficult speech task, since most electrolarynges are not designed to facilitate quick pitch changes. However, skilled Japanese electrolarynx talkers manage to produce contrasts of word accent to a certain extent. This paper reports some perceptual cues and their effects for word accent produced by Japanese electrolarynx talkers. Subjects are four male laryngectomees who use electrolarynges with good speech intelligibility. Audio recordings were made during multiple repetitions of ten types of word pairs. Each pair consisted of two‐mora words with identical phoneme sequences and contrasting accent types. Acoustic analyses revealed that the talkers successfully produced accent contrasts mainly by changing the segmental duration. Discussion will be made on some speech strategies that are not used by normal talkers for producing major perceptual cues but which are sometimes effective for the speech handicapped population for whom normal cues are not available.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Phoneme recognition using Kohonen's LVQ

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S60-S60页

作者： Erik McDermott Shigeru Katagiri ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

Teuvo Kohonen has recently developed an algorithm similar to that used in his feature map classifiers but in which learning is supervised rather than unsupervised. This algorithm, known as learning vector quantization (LVQ), is similar to a K‐nearest neighbor algorithm and allows a system to learn the vector quantization of the inputs to different categories. This algorithm is very simple, does not require a large number of training trials, and is capable of forming complex decision regions. As a recognition task, the speaker‐dependent recognition of the phonemes /b/, /d/, and /g/ in different phonetic contexts is considered. The training procedure is applied to speech patterns that are stepped through in time, thus providing the system with a measure of shift invariance. Preliminary results indicate that LVQ can yield a recognition rate of 98.3% for 1880 testing tokens from three speakers. The simple vector operations that constitute the core of LVQ allowed for very easy parallelization and thus high learning speed, i.e., less than an hour.

关键词：

来源：评论

学校读者我要写书评

暂无评论

On the role of the fundamental frequency in vowel perception

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S156-S156页

作者： Tatsuya Hirahara ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

Vowel identification tests were carried out using 200 synthesized vowel‐like stimuli to examine the role of the fundamental frequency F0 in vowel perception. These stimuli were synthetic versions of the five Japanese vowels, /i/, /e/, /a/, /o/, and /u/, of which the F0 and/or the formant frequencies Fi (i = 1,2,3,4) were modified: ten F0 values were formed by adding n/3 Bark (n = 0,1,…,9) to the original F0. Four formant frequency sets were formed by adding m Bark (m = 0,1,2,3) to the original formant frequencies for each vowel. The results are the following: (1) perceived vowel height articulation shifts upward when the F0 shifts upward, while all formant frequencies remain the same: (2) this shift in vowel height is more distinct amid mid and low vowels than for high vowels; and (3) vowel height does not change when the F0 as well as all formant frequencies are shifted upward the same amount along the Bark scale. Further results, along with the hypothesis that a high F0 is regarded as the first formant in middle and low vowel perception, will be discussed.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Relaxation‐based speech labeling

引用

The Journal of the Acoustical Society of America 2005年第S1期84卷 S60-S60页

作者： Shigeru Katagiri ATR Auditory and Visual Perception Research Laboratories Twin 21 MID Tower 2‐1‐61 Shiromi Higashi‐ku Osaka 540 Japan

It was revealed that a trained human spectrogram reader could perform accurate speech labeling, and that accuracy was based on the flexibility of his/her decision process using many kinds of spectrographic features [S. Katagiri, SP87‐115, IEICE Tech. Rep. (1988)]. In this paper, a new flexible speech labeling system that simulates the trained reader capability is proposed. The main task of the system is to apply the trial‐and‐error process used in a human reader's labeling work. Therefore, a relaxation method is adopted here [S. Katagiri, 2‐1‐19, A. S. J. Spring Meeting (1988)]. The system consists of three parts: an acoustic analyzer, a verifier, and a supervisor. In the acoustic analyzer, many kinds of acoustic feature candidates, e.g., formant and pitch frequencies, are calculated. In the verifier, possible speech labels are verified. The supervisor, with a behavior principle based on the relaxation method, controls the whole system. Experimental results show that the system's performance is comparable to a human reader's performance and that very accurate labels are automatically created.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Perceptual units of the infant cry

引用

The Journal of the Acoustical Society of America 2005年第S1期85卷 S53-S54页

作者： Taeko Tsukamoto Yoh'ichi Tohkura Konan Women's College Morikita‐machi Higashinada‐ku Kobe 658 Japan ATR Auditory and Visual Perception Research Laboratories Inuidani Seika‐cho Kyoto 619‐02 Japan

Perceptual units for category identification of infant cries have been studied. Three cry categories discussed in this paper are the hunger cry, the call cry (i.e., cry calling for infant‐mother interaction), and the anger cry. The original samples have been classified into these three categories. In order to generate stimuli to be used in the perceptual experiments, each of the cry samples is first segmented into single‐segment units according to breath groups of the cry samples. Next, the single‐segment units are combined with each other in temporal order to generate two‐, three‐, five‐, and seven‐segment unit stimuli. In addition to the multisegment unit stimuli thus obtained, the single‐segment units and the three original samples are used as one‐segment unit stimuli and full‐segment unit stimuli, respectively, in the perceptual experiments. In identifying the cry stimuli, subjects are instructed to make a forced choice among the three cry categories. The experimental results show that category identification rates are greatly dependent upon the number of segments making up each stimulus. However, the identification rates are temporarily saturated at two‐segment units in the call cry and at three‐ to five‐segment units in the hunger and anger cries. This fact indicates that the units with two to five segments are the perceptual units. Temporal duration of the perceptual units across all three categories is similar (i.e., about 6–8 s).

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：