检索结果-内蒙古大学图书馆

word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment

引用

COMPUTER SPEECH AND LANGUAGE 2016年 35卷 234-261页

作者： Stahlberg, Felix Schlippe, Tim Vogel, Stephan Schultz, Tanja KIT Cognit Syst Lab Karlsruhe Germany Qatar Fdn Qatar Comp Res Inst Doha Qatar

In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% 00V rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach. (C) 2014 Elsevier Ltd. All rights reserved.

关键词： Pronunciation dictionary Non-written languages Lexical language discovery Under-resourced languages Speech-to-speech translation word segmentation

来源：评论

学校读者我要写书评

暂无评论

Line, word, and Character segmentation from Bangla Handwritten Text-A Precursor Toward Bangla HOCR 4th

Line, Word, and Character Segmentation from Bangla Handwritt...

引用

4th International Doctoral Symposium on Applied Computation and Security Systems (ACSS)

作者： Rakshit, Payel Halder, Chayan Ghosh, Subhankar Roy, Kaushik West Bengal State Univ Dept Comp Sci Kolkata 700126 W Bengal India

ISBN: (纸本)9789811081804;9789811081798

The basic functionalities of optical character recognition (OCR) are to recognize and extract text to digitally editable text from document images. Apart from this, an OCR has other potentials in document image processing such as in automatic document sorter, writer identification/verification. In current situation, various commercially available OCR systems can be found mostly for Roman script. Development of an unconstrained offline handwritten character recognition system is one of the most challenging tasks for the research community. Things get more complicated when we consider Indic scripts like Bangla which contains more than 280 modified and compound characters along with isolated characters. For recognition of handwritten document, the most convenient way is to segment the text into characters or character parts. So line, word and character level segmentation plays a vital role in the development of such a system. In this paper, a scheme for tri-level segmentation (line, word, and character) is presented. Encouraging segmentation results are achieved on a set of 50 handwritten text documents.

关键词： OCR Bangla handwritten character recognition Line segmentation word segmentation Character segmentation

来源：评论

学校读者我要写书评

暂无评论

word segmentation using the Student's-t Distribution 12

Word Segmentation using the Student's-t Distribution

引用

12th IAPR International Workshop on Document Analysis Systems (DAS)

作者： Louloudis, Georgios Sfikas, Giorgos Stamatopoulos, Nikolaos Gatos, Basilis Demokritos Natl Ctr Sci Res Inst Informat & Telecommun Computat Intelligence Lab GR-15310 Athens Greece

ISBN: (纸本)9781509017928

word segmentation refers to the process of defining the word regions of a text line. It is a critical stage towards word and character recognition as well as word spotting and mainly concerns three basic stages, namely preprocessing, distance computation and gap classification. In this paper, we propose a novel word segmentation method which uses the Student's-t distribution for the gap classification stage. The main advantage of the Student's-t distribution concerns its robustness to the existence of outliers. In order to test the efficiency of the proposed method we used the four benchmarking datasets of the ICDAR/ICFHR Handwriting segmentation Contests as well as a historical typewritten dataset of Greek polytonic text. It is observed that the use of mixtures of Student's-t distributions for word segmentation outperforms other gap classification methods in terms of Recognition Accuracy and F-Measure. Also, in terms of all examined benchmarks, the Student's-t is shown to produce a perfect segmentation result in significantly more cases than the state-of-the-art Gaussian mixture model.

关键词： word segmentation Student's-t Distribution Finite mixture models Robust models

来源：评论

学校读者我要写书评

暂无评论

Speculation detection for Chinese clinical notes: Impacts of word segmentation and embedding models

引用

JOURNAL OF BIOMEDICAL INFORMATICS 2016年 60卷 334-341页

作者： Zhang, Shaodian Kang, Tian Zhang, Xingting Wen, Dong Elhadad, Noemie Lei, Jianbo Columbia Univ Dept Biomed Informat New York NY USA Peking Univ Med Informat Ctr 38 Xueyuan Rd Beijing 100191 Peoples R China

Speculations represent uncertainty toward certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting toward this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5103 gold-standard speculation annotations on 2000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language. (C) 2016 Elsevier Inc. All rights reserved.

关键词： Speculation detection Natural language processing Chinese NLP Clinical NLP word embedding word segmentation

来源：评论

学校读者我要写书评

暂无评论

The Impact of word segmentation on CCG-based Arabic-English SMT

The Impact of Word Segmentation on CCG-based Arabic-English ...

引用

2017 2nd International Conference on Artificial Intelligence: Techniques and Applications (AITA 2017)

作者： Hamdi Ahmed Rajeh Zhi-yong LI Al-Ghaili Mohammed College of Computer Science and Electronic Engineering Hunan University Key Laboratory for Embedded and Network Computing of Hunan Province

This paper presents a comparative study of two approaches to statistical machine translation(SMT).We present a study on Factored Machine Translation for the Arabic–English pair of *** illustrate pre-processing step for the Arabic source language and the new factors which added to the English target *** experiments that injected English side by part-ofspeech(POS) tags,Combinatory Categorical Grammar(CCG) supertags and segmented Arabic sentences displayed a considerable progress in terms of the BLEU *** experiments gained the premier examine of the Baseline phrase-based models during two approaches,Segmented(#S1) and non-segmented(#S2).In both approaches,CCG models acquired the greatest BLEU *** results show that the segmented approach which consists of CCG produce the highest accuracy at 30.91%(1.84% superior than baseline).As shown,the results presented considerable improvement in translating the segmented Arabic rather than the non-segmented into English language.

关键词： Phrase-based translation model Combinatory Categorial Grammar Part-of-speech Factored translation model word segmentation

来源：评论

学校读者我要写书评

暂无评论

Enhancing LSTM-based word segmentation Using Unlabeled Data

Enhancing LSTM-based Word Segmentation Using Unlabeled Data

引用

第十六届全国计算语言学学术会议暨第五届基于自然标注大数据的自然语言处理国际学术研讨会

作者： Bo Zheng Wanxiang Che Jiang Guo Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology China

ISBN: (纸本)9783319690049

word segmentation problem is widely solved as the sequence labeling *** traditional way to this kind of problem is ma-chine learning method like conditional random field with hand-crafted ***,deep learning approaches have achieved state-of-the-art performance on word segmentation task and a popular method of them is LSTM *** paper gives a method to introduce numer-ical statistics-based features counted on unlabeled data into LSTM net-works and analyzes how it enhances the performance of word segmenta-tion *** add pre-trained character-bigram embedding,pointwise mutual information,accessor variety and punctuation variety into our model and compare their performances on different datasets including three datasets from CoNLL-2017 shared task and three datasets of sim-plified *** achieve the state-of-the-art performance on two of them and get comparable results on the rest.

关键词： word segmentation statistics-based features neural net-work unlabeled data

来源：评论

学校读者我要写书评

暂无评论

The use of probabilistic lexicality cues for word segmentation in Chinese reading

引用

QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY 2016年第3期69卷 548-560页

作者： Zang, Chuanli Wang, Yongsheng Bai, Xuejun Yan, Guoli Drieghe, Denis Liversedge, Simon P. Tianjin Normal Univ Acad Psychol & Behav Tianjin Peoples R China Univ Southampton Sch Psychol Ctr Visual Cognit Shackleton Bldg Southampton SO17 1BJ Hants England

In an eye-tracking experiment we examined whether Chinese readers were sensitive to information concerning how often a Chinese character appears as a single-character word versus the first character in a two-character word, and whether readers use this information to segment words and adjust the amount of parafoveal processing of subsequent characters during reading. Participants read sentences containing a two-character target word with its first character more or less likely to be a single-character word. The boundary paradigm was used. The boundary appeared between the first character and the second character of the target word, and we manipulated whether readers saw an identity or a pseudocharacter preview of the second character of the target. Linear mixed-effects models revealed reduced preview benefit from the second character when the first character was more likely to be a single-character word. This suggests that Chinese readers use probabilistic combinatorial information about the likelihood of a Chinese character being single-character word or a two-character word online to modulate the extent of parafoveal processing.

关键词： Chinese reading Preview benefit Eye movements word segmentation

来源：评论

学校读者我要写书评

暂无评论

Dictionary-based word segmentation for Javanese

Dictionary-based Word Segmentation for Javanese

引用

5th Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU)

作者： Tanaya, Dipta Adriani, Mirna Univ Indonesia Depok 16424 Indonesia

word segmentation is the first step to process language that written in non-Latin letters such as such as Javanese script. In this study, we report our work on word segmentation based on dictionary approach. In the first phase, we generate all possible segmented word series using a word dictionary. The correct word is selected based on the last character in a word, the last two characters in a word, the difference of two consecutive words, and the frequency of the word in the additional corpus. The experimental results show that identifying words using the frequency of words in the additional corpus yield the best accuracy that is 91.08%. (C) 2016 The Authors. Published by Elsevier B.V.

关键词： javanese character word segmentation

来源：评论

学校读者我要写书评

暂无评论

Use of bound morphemes (noun particles) in word segmentation by Japanese-learning infants

引用

JOURNAL OF MEMORY AND LANGUAGE 2016年 88卷 18-27页

作者： Haryu, Etsuko Kajikawa, Sachiyo Univ Tokyo Tokyo 1130033 Japan Tamagawa Univ Tokyo Japan

Recent studies have shown that English-, French-, and German-learning infants begin to use determiners to segment adjacent nouns before their first birthday. The present research extended the investigation to a typologically different language, Japanese, focusing on infants' use of a high-frequency particle ga, a subject-marker. In Japanese, a particle follows, rather than precedes, the noun, and is usually followed by a predicate verb;thus particles rarely occur at utterance edges. Furthermore, a particle is not a free morpheme (like a determiner) but a bound morpheme. Although particles are frequently omitted in colloquial speech, the frequency of their occurrence is comparable to that of high frequency determiners in a previously studied language, French. The results demonstrated that Japanese-learning infants used particles for word segmentation not at 10 and 12 months but at 15 months, which is later than the age at which infants begin to use determiners in previously studied languages. The reason for this delay was discussed in light of the properties of Japanese particles. (C) 2015 Elsevier Inc. All rights reserved.

关键词： word segmentation Infant Bound morpheme Function word Noun particle Japanese

来源：评论

学校读者我要写书评

暂无评论

Effects of L1 Phonotactic Constraints on L2 word segmentation Strategies 17

Effects of L1 Phonotactic Constraints on L2 Word Segmentatio...

引用

17th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH 2016)

作者： Katayama, Tamami Prefectural Univ Hiroshima Hiroshima Japan

ISBN: (纸本)9781510833135

In the present study, it was examined whether phonotactic constraints of the first language affect speech processing by Japanese learners of English and whether L2 proficiency influences it. Seventeen native English speakers (ES), 18 Japanese speakers with high proficiency of English (JH), and 20 Japanese speakers with relatively low English proficiency (JL) took part in a monitoring task. Two types of target words (CVC/CV, e.g., team/tea) were embedded in bisyllabic non words (e.g., teamfesh) and given to the participants with other non-words in the lists. The three groups were instructed to respond as soon as they spot targets, and response times and error rates were analyzed. The results showed that all of the groups segmented the CVC target words significantly faster and more accurately than the CV targets. L1 phonontactic constraints did not hinder L2 speech processing, and a word segmentation strategy was not language-specific in the case of Japanese English learners.

关键词： word segmentation phonotactics L2 speech perception

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：