this paper gives an overview over the progression and results of the VODIS (Voice-operated Driver information Systems) project, a EU-funded project with participation of many industrial and academic partners within Eu...
详细信息
ISBN:
(纸本)7801501144
this paper gives an overview over the progression and results of the VODIS (Voice-operated Driver information Systems) project, a EU-funded project with participation of many industrial and academic partners within Europe. It describes the architecture and functionality of a driver information prototype system that is able to control telephone and audio devices, but also goes beyond these applications by enabling a potential user of the system to enter a destination within the navigation context by speech. the developed demonstrator allows the speaker-independent input of up to 70 predefined command words, phrases and even dynamically generated names (e.g. radio stations, phonebook entries) for hands- and eyes-free voice operation of the car-infotainment functions mentioned above. A prototype system has been realized in four European languages: German, French, English and Italian. User evaluations on driver level under realistic conditions have been conducted withthis system and the results are presented in this paper.
this paper describes the development and testing of a pilot spoken dialogue system for bus travel information in the city of Trondheim, Norway. the system driven dialogue was designed on the basis of analyzed recordin...
详细信息
ISBN:
(纸本)7801501144
this paper describes the development and testing of a pilot spoken dialogue system for bus travel information in the city of Trondheim, Norway. the system driven dialogue was designed on the basis of analyzed recordings from both human-human operator dialogues, Wizard-of-Oz (WoZ) dialogues, and a text-based inquiry system for the web. the dialogue system employs a flexible speech recognizer and an utterance concatenation procedure for speech output. Even though the system is intended for research only, it has been accessible through a public phone number since October 1999. During this period all dialogues have been recorded. From these, approximately 350 dialogues were selected for annotation and comparison to 120 dialogues from the WoZ recordings. the experiments showed that the turn error rate was more than twice as large for the real dialogues as for the WoZ calls, i.e., 13.3% versus 5.7%. thus, the WoZ results did not give a reliable estimate for the true performance. Our experiments indicate that the current flexible speech recognizer should be further optimized.
this paper describes MILER (Multi-modal data Logger for Evaluation and Report), a web-based multi-service monitoring, logging and reporting tool for advanced multimodal dialog systems. MILER has been designed to direc...
详细信息
ISBN:
(纸本)7801501144
this paper describes MILER (Multi-modal data Logger for Evaluation and Report), a web-based multi-service monitoring, logging and reporting tool for advanced multimodal dialog systems. MILER has been designed to directly arrange and synchronize logging data collected from live services and to provide real-time reports about service usage and system performance. Special attention has been given to the architecture design in order to achieve service and access-device independence and reliable synchronization of data from distributed logs. MILER allows researchers to analyze multi-modal interactions, analyze the call flow, reconstruct the system/user dialogue turns, play the recorded user utterances, and provide a preliminary dialogue performance evaluation. It also supports labeling and annotation of the dialogue turns for further offline analysis. Once the user inputs (i.e. speech and other input modalities) are manually transcribed and labeled, along with detailed log events from each dialog, MILER derives a set of objective measures, which includes word and concept accuracy, number of attempts per concept, dialog turn counts and duration, and task completion rates. Subjective measures extracted from user's surveys, including perceived task success and ease of use measures, can be combined withthe objective measures and the results used later for accuracy computation.
this paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is follo...
详细信息
ISBN:
(纸本)7801501144
this paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. the method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. the approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance.
Accurate training data plays a very important role in training effective acoustic models for speech recognition. In conversational speech, in several cases, the transcribed data has a significant word error rate which...
详细信息
ISBN:
(纸本)7801501144
Accurate training data plays a very important role in training effective acoustic models for speech recognition. In conversational speech, in several cases, the transcribed data has a significant word error rate which leads to bad acoustic models. In this paper we explore a method to automatically identify such mislabelled data in the context of a hybrid Support Vector Machine/hidden Markov model (HMM) system, thereby building accurate acoustic models. the effectiveness of this method is proven on both synthetic and real speech data. A hybrid system for OGI alphadigits using this methodology gives a significant improvement in performance over a comparable baseline HMM system.
In this paper, we describe problems in recognizing largevocabulary Korean continuous speech, and proposed solutions to them. Korean sentences consist of eojeols, which are separated by spaces in text and consist of mo...
详细信息
ISBN:
(纸本)7801501144
In this paper, we describe problems in recognizing largevocabulary Korean continuous speech, and proposed solutions to them. Korean sentences consist of eojeols, which are separated by spaces in text and consist of morphemes. When we use morpheme units, there are many word insertion and deletion errors because morpheme units are too short. We introduce a between-word phone variation lexicon that can represent many alternatives of phones of words in one structure. the decoding algorithm is composed of one pass, which is a modification of token-passing algorithm. In this algorithm, we allowed multiple tokens in a state at a time to get globalbest path without expanding the states when we use trigram language models. We confirmed thatbetween-word phone variation lexicon is useful for morpheme-based recognition by observing that the improvement is higher for morpheme units than for eojeol units. Allowing multiple tokens at a state also improved the performance.
this paper examines several hypotheses based on a 'strategic' view of word repetitions in English. We test whether these hypotheses also apply to Japanese with its fundamentally different syntax. Analyses of 1...
ISBN:
(纸本)7801501144
this paper examines several hypotheses based on a 'strategic' view of word repetitions in English. We test whether these hypotheses also apply to Japanese with its fundamentally different syntax. Analyses of 10 task-oriented Japanese dialogues reveal two effects. First, pauses are more frequent before and just after a word at a suspension of the speech than after a repetition of that word. Second, the first token of the repeated word is abnormally prolonged. these results support the 'strategic' view of repetitions. Speakers often suspend speaking after making a preliminary commitment to a constituent, but they prefer to produce that constituent with a continuous delivery. these findings suggest the generality of these strategies across languages.
the means of the long temporal trajectories of logarithmic critical band energies in a vicinity of individual phoneme show distinct patterns (TRAPs Fig 1) in each critical band for different phonemes. these temporal p...
详细信息
ISBN:
(纸本)7801501144
the means of the long temporal trajectories of logarithmic critical band energies in a vicinity of individual phoneme show distinct patterns (TRAPs Fig 1) in each critical band for different phonemes. these temporal patterns were successfully used in Automatic Speech Recognition [1]. By using the fact that they not only contain spectral evolution but also the average co-articulation of the phonemes, we examine to what extent they capture information about sound units by synthesizing speech from them.
A dialogue management method for a speech-based interactive system is described. the construction of an effective spoken dialogue system that has unrestricted input requires user-initiative dialogue management techniq...
详细信息
ISBN:
(纸本)7801501144
A dialogue management method for a speech-based interactive system is described. the construction of an effective spoken dialogue system that has unrestricted input requires user-initiative dialogue management techniques. In order to realize user-initiative dialogue, however, the system must basically accept all sentences included in its language model for recognition. that creates the problem of lower recognition accuracy, because of the weaker constraints with respect to speech understanding. We have thus proposed a method for scoring the appropriateness for the dialogue context to each of the N-best speech understanding results. In this method, the user's behavioral goal is inferred from a history of utterances and a dialogue context score is calculated for each of the N-best candidates based on transition probabilities of the behavioral goals. According to the results of preliminary evaluation experiments using a dialogue system focusing on a task of making hotel reservations, we succeeded in reducing the error rate (misunderstandings in acoustic scores only) to 66% for 15 sentences.
the voice conversion algorithm based on the Gaussian mixture model (GMM) has also been proposed by Stylianou et al. In this algorithm, the acoustic space of a speaker is represented continuously. In this paper, we app...
详细信息
ISBN:
(纸本)7801501144
the voice conversion algorithm based on the Gaussian mixture model (GMM) has also been proposed by Stylianou et al. In this algorithm, the acoustic space of a speaker is represented continuously. In this paper, we apply this GMM-based voice conversion algorithm to STRAIGHT proposed by Kawahara et al., which is recognized as a high quality vocoder. In order to evaluate this voice conversion algorithm, we perform subjective and objective experiments on speech quality and speaker individuality, comparing withthe method based on the codebook mapping. As results, the performance of the GMM-based voice conversion algorithm is better than that of the codebook mapping method. Effects by the amount of training data for the voice conversion algorithms are also investigated, as well as the number of the Gaussian mixtures. these evaluation results clarify that the GMM-based voice conversion algorithm is successfully applied to STRAIGHT.
暂无评论