检索结果-内蒙古大学图书馆

Efficient word segmentation for enhancing Chinese spelling check in pre-trained language model

KNOWLEDGE AND INFORMATION SYSTEMS 2025年第1期67卷 603-632页

作者： Li, Fangfang Jiang, Jie Tang, Dafu Shan, Youran Duan, Junwen Zhang, Shichao Cent South Univ Comp Sci & Engn Changsha 410083 Peoples R China Natl Univ Def Technol Coll Syst Engn Changsha 410073 Peoples R China Zhejiang Coll Secur Technol Coll Artificial Intelligence Wenzhou 325016 Peoples R China Guangxi Normal Univ Key Lab Educ Blockchain & Intelligent Technol Minist Educ Guilin 541004 Peoples R China Guangxi Normal Univ Guangxi Key Lab Multisource Informat Min & Secur Guilin 541004 Peoples R China

In existing pre-trained language models, Chinese spelling check (CSC) often considers phonetic and graphic details at the character level, and ignores the essential role of word segmentation. To address this issue, an efficient word segmentation is studied for CSC in this paper, referred to word segmentation-Enhanced Speller (WOSES). The WOSES comprises two distinct models, the word Speller (WSpeller) and the Hierarchical word Speller (H-WSpeller), designed both for mitigating the often-ignored word boundary errors in CSC. The WOSES framework outperforms existing benchmarks on standard datasets SIGHAN13 and SIGHAN15, attributed to its innovative use of word segmentation and the specialized pre-trained model, W-MLM. Notably, the WSpeller model within the WOSES framework achieves F1 score improvements of 3.3 and 2.1% on SIGHAN13 and SIGHAN15, respectively, compared to existing methods. In this paper, the importance of word segmentation is not only underscored in CSC, but also a novel performance standard is proposed in the domain.

关键词： Chinese spelling check Pre-trained language model word segmentation

来源：评论

学校读者我要写书评

暂无评论

Exploring the role of word segmentation on parafoveal processing during Chinese reading

引用

JOURNAL OF COGNITIVE PSYCHOLOGY 2025年第1期37卷 1-14页

作者： Xie, Fang Chen, Wanying Zhang, Lei Cao, Xiaohua Warrington, Kayleigh L. Zhejiang Normal Univ Zhejiang Philosophy & Social Sci Lab Mental Hlth & Jinhua Peoples R China Zhejiang Normal Univ Sch Psychol Jinhua Peoples R China Univ Leicester Sch Psychol & Vis Sci Leicester England

The importance of the word as a unit of meaning is well-established for readers of both alphabetic languages and Chinese. However, the unspaced nature of written Chinese raises questions about how readers use upcoming information to guide word segmentation and to adjust the parafoveal processing of subsequent characters. Using an eye-tracking experiment, we investigated whether Chinese readers pre-process character C2 more when it forms a word with C1 than when they belong to separate words. The boundary paradigm was used to manipulate the preview of C2, such that readers saw either an identity (normal) or pseudo-character preview. Linear mixed-effects models revealed reduced preview benefit when C1 and C2 were separate words. These results suggest that despite the absence of visual segmentation cues, Chinese readers are able to utilise the parafoveal preview to support the identification of word boundaries and modulate the extent of their parafoveal processing to prioritise the processing of word units.

关键词： Parafoveal processing word segmentation preview benefit eye-tracking experiment Chinese reading

来源：评论

学校读者我要写书评

暂无评论

word segmentation for Burmese (Myanmar)

引用

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2016年第4期15卷 1–10页

作者： Ding, Chenchen Thu, Ye Kyaw Utiyama, Masao Sumita, Eiichiro Natl Inst Informat & Commun Technol Adv Translat Technol Lab ASTREC 3-5 Hikaridai Seika Kyoto 6190289 Japan

Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches perform significantly better than dictionary-based approaches. We believe that this note, based on an annotated corpus of relatively considerable size (containing approximately a half million words), is the first systematic comparison of word segmentation approaches for Burmese. This work aims to discover the properties and proper approaches to Burmese textual processing and to promote further researches on this understudied language.

关键词： Burmese Myanmar syllable word segmentation algorithm

来源：评论

学校读者我要写书评

暂无评论

word segmentation Method for Handwritten Documents based on Structured Learning

引用

IEEE SIGNAL PROCESSING LETTERS 2015年第8期22卷 1161-1165页

作者： Ryu, Jewoong Koo, Hyung Il Cho, Nam Ik Seoul Natl Univ Dept Elect & Comp Engn INMC Seoul South Korea Ajou Univ Dept Elect & Comp Engn Suwon 441749 South Korea

segmentation of handwritten document images into text-lines and words is an essential task for optical character recognition. However, since the features of handwritten document are irregular and diverse depending on the person, it is considered a challenging problem. In order to address the problem, we formulate the word segmentation problem as a binary quadratic assignment problem that considers pairwise correlations between the gaps as well as the likelihoods of individual gaps. Even though many parameters are involved in our formulation, we estimate all parameters based on the Structured SVM framework so that the proposed method works well regardless of writing styles and written languages without user-defined parameters. Experimental results on ICDAR 2009/2013 handwriting segmentation databases show that proposed method achieves the state-of-the-art performance on Latin-based and Indian languages.

关键词： Handwritten documents structured SVM word segmentation

来源：评论

学校读者我要写书评

暂无评论

word segmentation for Burmese Based on Dual-Layer CRFs

引用

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2019年第1期18卷 1–11页

作者： Zhang, Shaoning Mao, Cunli Yu, Zhengtao Wang, Hongbin Li, Zhongwei Zhang, Jiafu Kunming Univ Sci & Technol Sch Informat Engn & Automat Kunming 650500 Yunnan Peoples R China

Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a word segmentation method with fusion conditions of double syllable features. It combines word segmentation and segmentation of syllables into one process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of the conditional random fields (CRF) model, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (Backus normal form) features to realize the Burma syllable sequence tags. In the second layer of the CRFs model, with the syllable marked as input, it realizes the sequence markers through building a feature template with syllables as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllables.

关键词： Burmese word segmentation CRFs BNF syllable segmentation

来源：评论

学校读者我要写书评

暂无评论

word segmentation and pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment

引用

COMPUTER SPEECH AND LANGUAGE 2016年 35卷 234-261页

作者： Stahlberg, Felix Schlippe, Tim Vogel, Stephan Schultz, Tanja KIT Cognit Syst Lab Karlsruhe Germany Qatar Fdn Qatar Comp Res Inst Doha Qatar

In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% 00V rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach. (C) 2014 Elsevier Ltd. All rights reserved.

关键词： Pronunciation dictionary Non-written languages Lexical language discovery Under-resourced languages Speech-to-speech translation word segmentation

来源：评论

学校读者我要写书评

暂无评论

word segmentation with universal prosodic cues

引用

COGNITIVE PSYCHOLOGY 2010年第2期61卷 177-199页

作者： Endress, Ansgar D. Hauser, Marc D. Harvard Univ MIT Dept Psychol Cambridge MA 02139 USA

When listening to speech from one's native language, words seem to be well separated from one another, like beads on a string. When listening to a foreign language, in contrast, words seem almost impossible to extract, as if there was only one bead on the same string. This contrast reveals that there are language-specific cues to segmentation. The puzzle, however, is that infants must be endowed with a language-independent mechanism for segmentation, as they ultimately solve the segmentation problem for any native language. Here, we approach the acquisition problem by asking whether there are language-independent cues to segmentation that might be available to even adult learners who have already acquired a native language. We show that adult learners recognize words in connected speech when only prosodic cues to word-boundaries are given from languages unfamiliar to the participants. In both artificial and natural speech, adult English speakers, with no prior exposure to the test languages, readily recognized words in natural languages with critically different prosodic patterns, including French, Turkish and Hungarian. We suggest that, even though languages differ in their sound structures, they carry universal prosodic characteristics. Further, these language-invariant prosodic cues provide a universally accessible mechanism for finding words in connected speech. These cues may enable infants to start acquiring words in any language even before they are fine-tuned to the sound structure of their native language. (C) 2010 Published by Elsevier Inc.

关键词： word segmentation Prosody Language universals Language acquisition Statistical learning

来源：评论

学校读者我要写书评

暂无评论

word segmentation for the Myanmar language

引用

JOURNAL OF INFORMATION SCIENCE 2008年第5期34卷 688-704页

作者： Thet, Tun Thura Na, Jin-Cheon Ko, Wunna Ko Nanyang Technol Univ Sch Commun & Informat Div Informat Studies Singapore 637718 Singapore Myanmar NLP Res Ctr Hlaing Yangon Myanmar

This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of characters without explicit word boundary delimiters. The proposed method has two phases: syllable segmentation and syllable merging. A rule-based heuristic approach was adopted for syllable segmentation, and a dictionary-based statistical approach for syllable merging. Evaluation of test results showed that the method is very effective for the Myanmar language.

关键词： Myanmar language word segmentation natural language processing syllable segmentation syllable merging collocation strength mutual information

来源：评论

学校读者我要写书评

暂无评论

word segmentation by 8-month-olds: When speech cues count more than statistics

引用

JOURNAL OF MEMORY AND LANGUAGE 2001年第4期44卷 548-567页

作者： Johnson, EK Jusczyk, PW Johns Hopkins Univ Dept Psychol Baltimore MD 21218 USA Johns Hopkins Univ Dept Cognit Sci Baltimore MD 21218 USA

Fluent speech contains few pauses between adjacent words. Cues such as stress, phonotactic constraints, and the statistical structure: of the input aid infants in discovering word boundaries. None of the many available segmentation cues is foolproof. So. we used the headturn preference procedure to investigate infants. integration of multiple cues. We also explored whether infants find speech cues produced by coarticulation useful in word segmentation.. Using natural speech syllables. we replicated Saffran. Aslin. et al.'s (1996) study demonstrating that X-month-olds can segment a continuous stream of speech based on statistical cues alone. Next. Eve added conflicting segmentation cuts. Experiment 2 pitted stress against statistics. whereas Experiment 3 pitted coarticulation against statistics. In both cases. 8-month-olds weighed speech cues more heavily than statistical cues. This I,observation was verified in Experiment 4, which indicated that greater complexity of the familiarization sequence does not necessarily lead to familiarity effects. (C) 2001 Academic Press.

关键词： coarticulation word segmentation stress cues statistical cues prosodic stress transitional probabilities statistical learning

来源：评论

学校读者我要写书评

暂无评论

word segmentation from noise-band vocoded speech

引用

LANGUAGE COGNITION AND NEUROSCIENCE 2017年第10期32卷 1344-1356页

作者： Grieco-Calub, Tina M. Simeon, Katherine M. Snyder, Hillary E. Lew-Williams, Casey Northwestern Univ Roxelyn & Richard Pepper Dept Commun Sci & Disord Evanston IL 60208 USA Princeton Univ Dept Psychol Princeton NJ 08544 USA

Spectral degradation reduces access to the acoustics of spoken language and compromises how learners break into its structure. We hypothesised that spectral degradation disrupts word segmentation, but that listeners can exploit other cues to restore detection of words. Normal-hearing adults were familiarised to artificial speech that was unprocessed or spectrally degraded by noise-band vocoding into 16 or 8 spectral channels. The monotonic speech stream was pause-free (Experiment 1), interspersed with isolated words (Experiment 2), or slowed by 33% (Experiment 3). Participants were tested on segmentation of familiar vs. novel syllable sequences and on recognition of individual syllables. As expected, vocoding hindered both word segmentation and syllable recognition. The addition of isolated words, but not slowed speech, improved segmentation. We conclude that syllable recognition is necessary but not sufficient for successful word segmentation, and that isolated words can facilitate listeners' access to the structure of acoustically degraded speech.

关键词： word segmentation noise-band vocoding spectral degradation speech rate isolated words

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：