In existing pre-trained language models, Chinese spelling check (CSC) often considers phonetic and graphic details at the character level, and ignores the essential role of word segmentation. To address this issue, an...
详细信息
In existing pre-trained language models, Chinese spelling check (CSC) often considers phonetic and graphic details at the character level, and ignores the essential role of word segmentation. To address this issue, an efficient word segmentation is studied for CSC in this paper, referred to word segmentation-Enhanced Speller (WOSES). The WOSES comprises two distinct models, the word Speller (WSpeller) and the Hierarchical word Speller (H-WSpeller), designed both for mitigating the often-ignored word boundary errors in CSC. The WOSES framework outperforms existing benchmarks on standard datasets SIGHAN13 and SIGHAN15, attributed to its innovative use of word segmentation and the specialized pre-trained model, W-MLM. Notably, the WSpeller model within the WOSES framework achieves F1 score improvements of 3.3 and 2.1% on SIGHAN13 and SIGHAN15, respectively, compared to existing methods. In this paper, the importance of word segmentation is not only underscored in CSC, but also a novel performance standard is proposed in the domain.
The importance of the word as a unit of meaning is well-established for readers of both alphabetic languages and Chinese. However, the unspaced nature of written Chinese raises questions about how readers use upcoming...
详细信息
The importance of the word as a unit of meaning is well-established for readers of both alphabetic languages and Chinese. However, the unspaced nature of written Chinese raises questions about how readers use upcoming information to guide word segmentation and to adjust the parafoveal processing of subsequent characters. Using an eye-tracking experiment, we investigated whether Chinese readers pre-process character C2 more when it forms a word with C1 than when they belong to separate words. The boundary paradigm was used to manipulate the preview of C2, such that readers saw either an identity (normal) or pseudo-character preview. Linear mixed-effects models revealed reduced preview benefit when C1 and C2 were separate words. These results suggest that despite the absence of visual segmentation cues, Chinese readers are able to utilise the parafoveal preview to support the identification of word boundaries and modulate the extent of their parafoveal processing to prioritise the processing of word units.
Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimen...
详细信息
Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches perform significantly better than dictionary-based approaches. We believe that this note, based on an annotated corpus of relatively considerable size (containing approximately a half million words), is the first systematic comparison of word segmentation approaches for Burmese. This work aims to discover the properties and proper approaches to Burmese textual processing and to promote further researches on this understudied language.
segmentation of handwritten document images into text-lines and words is an essential task for optical character recognition. However, since the features of handwritten document are irregular and diverse depending on ...
详细信息
segmentation of handwritten document images into text-lines and words is an essential task for optical character recognition. However, since the features of handwritten document are irregular and diverse depending on the person, it is considered a challenging problem. In order to address the problem, we formulate the word segmentation problem as a binary quadratic assignment problem that considers pairwise correlations between the gaps as well as the likelihoods of individual gaps. Even though many parameters are involved in our formulation, we estimate all parameters based on the Structured SVM framework so that the proposed method works well regardless of writing styles and written languages without user-defined parameters. Experimental results on ICDAR 2009/2013 handwriting segmentation databases show that proposed method achieves the state-of-the-art performance on Latin-based and Indian languages.
Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a wor...
详细信息
Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a word segmentation method with fusion conditions of double syllable features. It combines word segmentation and segmentation of syllables into one process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of the conditional random fields (CRF) model, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (Backus normal form) features to realize the Burma syllable sequence tags. In the second layer of the CRFs model, with the syllable marked as input, it realizes the sequence markers through building a feature template with syllables as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllables.
In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction...
详细信息
In this paper, we study methods to discover words and extract their pronunciations from audio data for non-written and under-resourced languages. We examine the potential and the challenges of pronunciation extraction from phoneme sequences through cross-lingual word-to-phoneme alignment. In our scenario a human translator produces utterances in the (non-written) target language from prompts in a resource-rich source language. We add the resource-rich source language prompts to help the word discovery and pronunciation extraction process. By aligning the source language words to the target language phonemes, we segment the phoneme sequences into word-like chunks. The resulting chunks are interpreted as putative word pronunciations but are very prone to alignment and phoneme recognition errors. Thus we suggest our alignment model Model 3P that is particularly designed for cross-lingual word-to-phoneme alignment. We present two different methods (source word dependent and independent clustering) that extract word pronunciations from word-to-phoneme alignments and compare them. We show that both methods compensate for phoneme recognition and alignment errors. We also extract a parallel corpus consisting of 15 different translations in 10 languages from the Christian Bible to evaluate our alignment model and error recovery methods. For example, based on noisy target language phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a Spanish Bible translation with 4.5% 00V rate, where 64% of the extracted pronunciations contain no more than one wrong phoneme. Finally, we use the extracted pronunciations in an automatic speech recognition system for the target language and report promising word error rates given that pronunciation dictionary and language model are learned completely unsupervised and no written form for the target language is required for our approach. (C) 2014 Elsevier Ltd. All rights reserved.
When listening to speech from one's native language, words seem to be well separated from one another, like beads on a string. When listening to a foreign language, in contrast, words seem almost impossible to ext...
详细信息
When listening to speech from one's native language, words seem to be well separated from one another, like beads on a string. When listening to a foreign language, in contrast, words seem almost impossible to extract, as if there was only one bead on the same string. This contrast reveals that there are language-specific cues to segmentation. The puzzle, however, is that infants must be endowed with a language-independent mechanism for segmentation, as they ultimately solve the segmentation problem for any native language. Here, we approach the acquisition problem by asking whether there are language-independent cues to segmentation that might be available to even adult learners who have already acquired a native language. We show that adult learners recognize words in connected speech when only prosodic cues to word-boundaries are given from languages unfamiliar to the participants. In both artificial and natural speech, adult English speakers, with no prior exposure to the test languages, readily recognized words in natural languages with critically different prosodic patterns, including French, Turkish and Hungarian. We suggest that, even though languages differ in their sound structures, they carry universal prosodic characteristics. Further, these language-invariant prosodic cues provide a universally accessible mechanism for finding words in connected speech. These cues may enable infants to start acquiring words in any language even before they are fine-tuned to the sound structure of their native language. (C) 2010 Published by Elsevier Inc.
This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. word segmentation is an essential step prior to natural language processing in the Myanmar language, because a ...
详细信息
This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of characters without explicit word boundary delimiters. The proposed method has two phases: syllable segmentation and syllable merging. A rule-based heuristic approach was adopted for syllable segmentation, and a dictionary-based statistical approach for syllable merging. Evaluation of test results showed that the method is very effective for the Myanmar language.
Fluent speech contains few pauses between adjacent words. Cues such as stress, phonotactic constraints, and the statistical structure: of the input aid infants in discovering word boundaries. None of the many availabl...
详细信息
Fluent speech contains few pauses between adjacent words. Cues such as stress, phonotactic constraints, and the statistical structure: of the input aid infants in discovering word boundaries. None of the many available segmentation cues is foolproof. So. we used the headturn preference procedure to investigate infants. integration of multiple cues. We also explored whether infants find speech cues produced by coarticulation useful in word segmentation.. Using natural speech syllables. we replicated Saffran. Aslin. et al.'s (1996) study demonstrating that X-month-olds can segment a continuous stream of speech based on statistical cues alone. Next. Eve added conflicting segmentation cuts. Experiment 2 pitted stress against statistics. whereas Experiment 3 pitted coarticulation against statistics. In both cases. 8-month-olds weighed speech cues more heavily than statistical cues. This I,observation was verified in Experiment 4, which indicated that greater complexity of the familiarization sequence does not necessarily lead to familiarity effects. (C) 2001 Academic Press.
Spectral degradation reduces access to the acoustics of spoken language and compromises how learners break into its structure. We hypothesised that spectral degradation disrupts word segmentation, but that listeners c...
详细信息
Spectral degradation reduces access to the acoustics of spoken language and compromises how learners break into its structure. We hypothesised that spectral degradation disrupts word segmentation, but that listeners can exploit other cues to restore detection of words. Normal-hearing adults were familiarised to artificial speech that was unprocessed or spectrally degraded by noise-band vocoding into 16 or 8 spectral channels. The monotonic speech stream was pause-free (Experiment 1), interspersed with isolated words (Experiment 2), or slowed by 33% (Experiment 3). Participants were tested on segmentation of familiar vs. novel syllable sequences and on recognition of individual syllables. As expected, vocoding hindered both word segmentation and syllable recognition. The addition of isolated words, but not slowed speech, improved segmentation. We conclude that syllable recognition is necessary but not sufficient for successful word segmentation, and that isolated words can facilitate listeners' access to the structure of acoustically degraded speech.
暂无评论