The text embedded in images provides important information for image understanding. text segmentation is an essential step for text recognition. It is often difficult to segment text from images at low resolution or w...
详细信息
The text embedded in images provides important information for image understanding. text segmentation is an essential step for text recognition. It is often difficult to segment text from images at low resolution or with complex background. In this paper, a novel text segmentation framework is proposed to solve the problem. The proposed framework adopts a hybrid strategy integrating two different text segmentation methods to produce text candidates. One segmentation method is designed based on the intensity uniformity of text regions, while the other is developed by integrating the features of intensity and stroke width of text. To separate text pixels from the text candidates, a new non-text pixel filtering method is proposed. In the filtering method, an effective classifier is designed based on the number of breaking elements and the k-means clustering algorithm. The performance of the proposed segmentation framework is tested by the pixel-based and recognition-based evaluation methods. Experimental results show that the F-score of the proposed framework on the video caption dataset and born-digital dataset of ICDAR2013 are 95.29% and 89.09% respectively, while the correctly recognized character rate and word rate on the German TV public dataset are 91.00% and 72.33%. The experimental results indicate that the proposed text segmentation framework has excellent performance and high robustness in text segmentation and recognition.
In this article we present a new method for text segmentation. The method relies on the number of lexical chains (LCs) which end in a sentence, which begin in the following sentence and which traverse the two successi...
详细信息
In this article we present a new method for text segmentation. The method relies on the number of lexical chains (LCs) which end in a sentence, which begin in the following sentence and which traverse the two successive sentences. The lexical chains are based on Roget's thesaurus (the 1987 and the 1911 version). We evaluate the method on ten texts from the DUC 2002 conference and on twenty texts from the CAST project corpus, using a manual segmentation as gold standard.
In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of two unsupervised topic models, latent Dirichlet allocation (LDA) and multinomial mixture (MM), to...
详细信息
In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of two unsupervised topic models, latent Dirichlet allocation (LDA) and multinomial mixture (MM), to segment a text into semantically coherent parts. The proposed topic model based approaches consistently outperform a standard baseline method on several datasets. A major benefit of the proposed LDA based approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications such as segment retrieval and discourse analysis. However, the proposed approaches, especially the LDA based method, have high computational requirements. Based on an analysis of the dynamic programming (DP) algorithm typically used for segmentation, we suggest a modification to DP that dramatically speeds up the process with no loss in performance. The proposed modification to the DP algorithm is not specific to the topic models only;it is applicable to all the algorithms that use DP for the task of text segmentation., (C) 2010 Elsevier Ltd. All rights reserved.
With the rapid development of natural language processing technology, text segmentation has become an important task in text processing. However, existing text segmentation methods often perform poorly when faced with...
详细信息
With the rapid development of natural language processing technology, text segmentation has become an important task in text processing. However, existing text segmentation methods often perform poorly when faced with long texts and complex structures, requiring a more efficient and accurate approach. In this paper, we propose a new text segmentation method based on the Hierarchical Document Attention (HDA), which automatically identifies and segments different paragraphs in the text by analyzing and weighting the hierarchical structure of the text sequence data. Compared with existing methods, the model has higher accuracy and efficiency, and better supports tasks such as text analysis and information extraction. The main contribution of this paper is the proposal of a text segmentation method based on the HDA, which effectively models text sequences through multi-level attention mechanisms. Experimental verification on public datasets shows that this model exhibits good performance in text segmentation tasks.
In natural scene, text elements are corrupted by many types of noise, such as streaks, highlights, or cracks. These effects make the clean and automatic segmentation very difficult and can reduce the accuracy of furth...
详细信息
In natural scene, text elements are corrupted by many types of noise, such as streaks, highlights, or cracks. These effects make the clean and automatic segmentation very difficult and can reduce the accuracy of further analysis such as optical character recognition. We propose a method to drastically improve segmentation using tensor voting as the main filtering step. We first decompose an image into chromatic and achromatic regions. We then identify text layers using tensor voting, and remove noise using adaptive median filter iteratively. Finally, density estimation for center modes detection and K-means clustering algorithm is performed later for segmentation of values according to hue or intensity component in the improved image. Excellent results are achieved in experiments on real images. (c) 2006 Elsevier B.V. All rights reserved.
text segmentation is important for text image analysis and recognition;however, it is challenging due to noise and complex background in natural scenes. Superpixel-based image representation can enhance robustness to ...
详细信息
text segmentation is important for text image analysis and recognition;however, it is challenging due to noise and complex background in natural scenes. Superpixel-based image representation can enhance robustness to noise and local disturbances, but conventional superpixel algorithms are difficult to obtain the complete stroke regions and accurate boundaries for text images. In this study, a text segmentation method based on superpixel clustering is proposed. First, to generate accurate superpixels for text images, an adaptive simple linear iterative clustering-based text superpixel generation algorithm is proposed. The adaptive superpixel size and compactness are calculated to enhance boundary adherence. Second, to increase the complete coverage of strokes from superpixels, superpixel clustering merges homogeneous superpixels into larger regions for both strokes and the background. A modified density-based spatial clustering of applications with noise is proposed. Finally, stroke superpixel verification assigns each region to a stroke or to the background and the text segmentation result is obtained. The proposed method shows promising robustness to noise and complex background textures. Experimental results on the Korea Advanced Institute of Science and Technology (KAIST) scene text dataset, International Conference on Document Analysis and Recognition (ICDAR) 2003 natural scene text image dataset and Street View text dataset verify that this method is effective and significantly outperforms existing methods.
In this paper, we use Barry and Hartigan's Product Partition Models to formulate text segmentation as an optimization problem, which we solve by a fast dynamic programming algorithm. We test the algorithm on Choi&...
详细信息
In this paper, we use Barry and Hartigan's Product Partition Models to formulate text segmentation as an optimization problem, which we solve by a fast dynamic programming algorithm. We test the algorithm on Choi's segmentation benchmark and achieve the best segmentation results so far reported in the literature. (C) 2004 Elsevier Ltd. All rights reserved.
text segmentation has played an important role in information retrieval as well as natural language processing. Current segmentation methods are well suited for written and structured texts making use of their distinc...
详细信息
text segmentation has played an important role in information retrieval as well as natural language processing. Current segmentation methods are well suited for written and structured texts making use of their distinctive macro-level structures;however text segmentation of transcribed multi-party conversation presents a different challenge given its ill-formed sentences and the lack of macro-level text units. This paper describes an algorithm suitable for segmenting spoken meeting transcripts combining semantically complex lexical relations with speech cue phrases to build lexical chains in determining topic boundaries.
Images of ancient maps and floor plans can present a great challenge for common character recognition tools. Besides the damage caused by time and handling, these documents have an important part of their information ...
详细信息
ISBN:
(纸本)9781479919598
Images of ancient maps and floor plans can present a great challenge for common character recognition tools. Besides the damage caused by time and handling, these documents have an important part of their information described graphically. In most examples, drawings of rivers or walls occupy most part of the document. Usually, text has different styles, sizes and orientations with possible overlapping with graphics. This paper presents a new method for text segmentation in images of ancient topographic maps and floor plans that uses a machine learning algorithm specialized in novelty detection to decide which components of the image are textual. Despite using artificial text examples for training, the method is able to outperform other state-of-the-art methods when applied to real images.
text segmentation is very important for many fields including information retrieval, summarization, language modeling, anaphora resolution and so on. text segmentation based on PLSA-textTiling associates different lat...
详细信息
ISBN:
(纸本)9783038351153
text segmentation is very important for many fields including information retrieval, summarization, language modeling, anaphora resolution and so on. text segmentation based on PLSA-textTiling associates different latent topic swith observable pairs of word and sentence. In the experiments, the whole sentences are taken as elementary blocks. PLSA model is used to calculated similarity metric basing on the idea of TestTiling and several approaches to discovering boundaries are tried. The results show the P mu value is 0.87, which is better than that of other algorithms of text segmentation.
暂无评论