In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction...
详细信息
In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% 00V rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach. (C) 2014 Elsevier Ltd. All rights reserved.
The basic functionalities of optical character recognition (OCR) are to recognize and extract text to digitally editable text from document images. Apart from this, an OCR has other potentials in document image proces...
详细信息
ISBN:
(纸本)9789811081804;9789811081798
The basic functionalities of optical character recognition (OCR) are to recognize and extract text to digitally editable text from document images. Apart from this, an OCR has other potentials in document image processing such as in automatic document sorter, writer identification/verification. In current situation, various commercially available OCR systems can be found mostly for Roman script. Development of an unconstrained offline handwritten character recognition system is one of the most challenging tasks for the research community. Things get more complicated when we consider Indic scripts like Bangla which contains more than 280 modified and compound characters along with isolated characters. For recognition of handwritten document, the most convenient way is to segment the text into characters or character parts. So line, word and character level segmentation plays a vital role in the development of such a system. In this paper, a scheme for tri-level segmentation (line, word, and character) is presented. Encouraging segmentation results are achieved on a set of 50 handwritten text documents.
word segmentation refers to the process of defining the word regions of a text line. It is a critical stage towards word and character recognition as well as word spotting and mainly concerns three basic stages, namel...
详细信息
ISBN:
(纸本)9781509017928
word segmentation refers to the process of defining the word regions of a text line. It is a critical stage towards word and character recognition as well as word spotting and mainly concerns three basic stages, namely preprocessing, distance computation and gap classification. In this paper, we propose a novel word segmentation method which uses the Student's-t distribution for the gap classification stage. The main advantage of the Student's-t distribution concerns its robustness to the existence of outliers. In order to test the efficiency of the proposed method we used the four benchmarking datasets of the ICDAR/ICFHR Handwriting segmentation Contests as well as a historical typewritten dataset of Greek polytonic text. It is observed that the use of mixtures of Student's-t distributions for word segmentation outperforms other gap classification methods in terms of Recognition Accuracy and F-Measure. Also, in terms of all examined benchmarks, the Student's-t is shown to produce a perfect segmentation result in significantly more cases than the state-of-the-art Gaussian mixture model.
Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting ...
详细信息
Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language. (C) 2016 Elsevier Inc. All rights reserved.
This paper presents a comparative study of two approaches to statistical machine translation(SMT).We present a study on Factored Machine Translation for the Arabic–English pair of *** illustrate pre-processing step f...
详细信息
This paper presents a comparative study of two approaches to statistical machine translation(SMT).We present a study on Factored Machine Translation for the Arabic–English pair of *** illustrate pre-processing step for the Arabic source language and the new factors which added to the English target *** experiments that injected English side by part-ofspeech(POS) tags,Combinatory Categorical Grammar(CCG) supertags and segmented Arabic sentences displayed a considerable progress in terms of the BLEU *** experiments gained the premier examine of the Baseline phrase-based models during two approaches,Segmented(#S1) and non-segmented(#S2).In both approaches,CCG models acquired the greatest BLEU *** results show that the segmented approach which consists of CCG produce the highest accuracy at 30.91%(1.84% superior than baseline).As shown,the results presented considerable improvement in translating the segmented Arabic rather than the non-segmented into English language.
word segmentation problem is widely solved as the sequence labeling *** traditional way to this kind of problem is ma-chine learning method like conditional random field with hand-crafted ***,deep learning approaches ...
详细信息
ISBN:
(纸本)9783319690049
word segmentation problem is widely solved as the sequence labeling *** traditional way to this kind of problem is ma-chine learning method like conditional random field with hand-crafted ***,deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM *** paper gives a method to introduce numer-ical statistics-based features counted on unlabeled data into LSTM net-works and analyzes how it enhances the performance of word segmenta-tion *** add pre-trained character-bigram embedding,pointwise mutual information,accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of sim-plified *** achieve the state-of-the-art performance on two of them and get comparable results on the rest.
In an eye-tracking experiment we examined whether Chinese readers were sensitive to information concerning how often a Chinese character appears as a single-character word versus the first character in a two-character...
详细信息
In an eye-tracking experiment we examined whether Chinese readers were sensitive to information concerning how often a Chinese character appears as a single-character word versus the first character in a two-character word, and whether readers use this information to segment words and adjust the amount of parafoveal processing of subsequent characters during reading. Participants read sentences containing a two-character target word with its first character more or less likely to be a single-character word. The boundary paradigm was used. The boundary appeared between the first character and the second character of the target word, and we manipulated whether readers saw an identity or a pseudocharacter preview of the second character of the target. Linear mixed-effects models revealed reduced preview benefit from the second character when the first character was more likely to be a single-character word. This suggests that Chinese readers use probabilistic combinatorial information about the likelihood of a Chinese character being single-character word or a two-character word online to modulate the extent of parafoveal processing.
word segmentation is the first step to process language that written in non-Latin letters such as such as Javanese script. In this study, we report our work on word segmentation based on dictionary approach. In the fi...
详细信息
word segmentation is the first step to process language that written in non-Latin letters such as such as Javanese script. In this study, we report our work on word segmentation based on dictionary approach. In the first phase, we generate all possible segmented word series using a word dictionary. The correct word is selected based on the last character in a word, the last two characters in a word, the difference of two consecutive words, and the frequency of the word in the additional corpus. The experimental results show that identifying words using the frequency of words in the additional corpus yield the best accuracy that is 91.08%. (C) 2016 The Authors. Published by Elsevier B.V.
Recent studies have shown that English-, French-, and German-learning infants begin to use determiners to segment adjacent nouns before their first birthday. The present research extended the investigation to a typolo...
详细信息
Recent studies have shown that English-, French-, and German-learning infants begin to use determiners to segment adjacent nouns before their first birthday. The present research extended the investigation to a typologically different language, Japanese, focusing on infants' use of a high-frequency particle ga, a subject-marker. In Japanese, a particle follows, rather than precedes, the noun, and is usually followed by a predicate verb;thus particles rarely occur at utterance edges. Furthermore, a particle is not a free morpheme (like a determiner) but a bound morpheme. Although particles are frequently omitted in colloquial speech, the frequency of their occurrence is comparable to that of high frequency determiners in a previously studied language, French. The results demonstrated that Japanese-learning infants used particles for word segmentation not at 10 and 12 months but at 15 months, which is later than the age at which infants begin to use determiners in previously studied languages. The reason for this delay was discussed in light of the properties of Japanese particles. (C) 2015 Elsevier Inc. All rights reserved.
In the present study, it was examined whether phonotactic constraints of the first language affect speech processing by Japanese learners of English and whether L2 proficiency influences it. Seventeen native English s...
详细信息
ISBN:
(纸本)9781510833135
In the present study, it was examined whether phonotactic constraints of the first language affect speech processing by Japanese learners of English and whether L2 proficiency influences it. Seventeen native English speakers (ES), 18 Japanese speakers with high proficiency of English (JH), and 20 Japanese speakers with relatively low English proficiency (JL) took part in a monitoring task. Two types of target words (CVC/CV, e.g., team/tea) were embedded in bisyllabic non words (e.g., teamfesh) and given to the participants with other non-words in the lists. The three groups were instructed to respond as soon as they spot targets, and response times and error rates were analyzed. The results showed that all of the groups segmented the CVC target words significantly faster and more accurately than the CV targets. L1 phonontactic constraints did not hinder L2 speech processing, and a word segmentation strategy was not language-specific in the case of Japanese English learners.
暂无评论