We propose simple and fast algorithms for detection of italic, bold and all-capital words without doing actual character recognition. We present a statistical study which reveals that the detection of such words may p...
详细信息
We propose simple and fast algorithms for detection of italic, bold and all-capital words without doing actual character recognition. We present a statistical study which reveals that the detection of such words may play a key role in automatic information retrieval from documents. Moreover, detection of italic words can be used to improve the recognition accuracy of a text recognition system. Considerable number of document images have been tested and our algorithms give accurate results on all the tested images, and the algorithms are very easy to implement.
This paper presents a pronominal anaphora resolution (PAR) approach that makes use of the global discourse knowledge along with other traditional features. So far the features used in finding the referent of an anapho...
详细信息
This paper presents a pronominal anaphora resolution (PAR) approach that makes use of the global discourse knowledge along with other traditional features. So far the features used in finding the referent of an anaphoric pronoun are computed locally. Normally the sentence containing the anaphor and a few sentences immediately before form the local context. In this process, the knowledge base gets updated as more and more of the discourse is processed. Keeping this approach as the core, the present paper explores use of some prior knowledge after examining the entire discourse (whole article). Addition of this processing step improves the PAR's efficiency. This improvement is demonstrated using ICON 2011 Bangla dataset.
To take care of variability involved in the writing style of different individuals in this paper we propose a robust scheme to segment unconstrained handwritten Bangla texts into lines, words and characters. For line ...
详细信息
ISBN:
(纸本)0769519601
To take care of variability involved in the writing style of different individuals in this paper we propose a robust scheme to segment unconstrained handwritten Bangla texts into lines, words and characters. For line segmentation, at first, we divide the text into vertical stripes. Stripe width of a document is computed by statistical analysis of the text height in the document. Next we determine horizontal histogram of these stripes and the relationship of the minimal values of the histograms is used to segment text lines. Based on vertical projection profile lines are segmented into words. Segmentation of characters from handwritten word is very tricky as the characters are seldom vertically separable. We use a concept based on water reservoir principle for the purpose. Here we, at first, identify isolated and connected (touching) characters in a word. Next touching characters of the word are segmented based on the reservoir base area points and structural feature of the component.
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
详细信息
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi language) text into characters. Based on certain characteristics of Bangla writing methods, different zones across the height of the word are detected. These zones provide certain structural information about the constituent characters of the respective word. In Bangla handwritten texts often there is overlap between rectangular hulls of successive characters. As such the characters are seldom vertically separable. So, we propose a method of recursive contour following in one of the zones across the height of the word to find out the extents within which the main portion of the character lies. If the successive characters are not touching in the zone of contour following, the algorithm gives fairly good results.
This paper deals with an Optical Character recognition system for printed Urdu, a popular indian script. The development of OCR for this script is difficult because (i) a large number of characters have to be recogniz...
详细信息
ISBN:
(纸本)0769519601
This paper deals with an Optical Character recognition system for printed Urdu, a popular indian script. The development of OCR for this script is difficult because (i) a large number of characters have to be recognized (ii) there are many similar shaped characters. In the proposed system individual characters are recognized using a combination of topological, contour and water reservoir concept based features. The feature detection methods are simple and robust. A prototype of the system has been tested on printed Urdu characters and currently achieves 97.8% character level accuracy on average.
The backpropagation algorithm helps a multilayer perceptron to learn to map a set of inputs to a set of outputs. But often its function approximation performance is not impressive. In this paper the authors demonstrat...
详细信息
The backpropagation algorithm helps a multilayer perceptron to learn to map a set of inputs to a set of outputs. But often its function approximation performance is not impressive. In this paper the authors demonstrate that self-adaptation of the learning rate of the backpropagation algorithm helps in improving the approximation of a function. The modified backpropagation algorithm with self-adaptive learning rates is based on a combination of two updating rules-one for updating the connection weights and the other for updating the learning rate. The method for learning rate updating implements the gradient descent principle on the error surface. Simulation results with astrophysical data are presented.
In this paper, we describe an approach to distinguish between hand-written text and machine-printed text from annotated machine-printed Bangla Documents images. In applications involving OCR, distinction of machine-pr...
详细信息
In this paper, we describe an approach to distinguish between hand-written text and machine-printed text from annotated machine-printed Bangla Documents images. In applications involving OCR, distinction of machine-printed and hand-written characters is important, so that they can be sent to separate recognition engines. Identification of hand-written parts is useful in deleting those parts and cleaning the document image as well. In this paper a classification system is presented which takes a connected component in the document image and assigns them to two classes namely "machine-printed" and for "hand-written" classes, respectively. The proposed system contains a preprocessing step, which smoothes the object border and finds the Connected Component. Bangla script specific features are extracted from that Connected Component image, and a standard classifier based on SVM generates the final response. Experimental results on a data set show that the proposed approach achieves an overall accuracy of 96.49%.
Existence of touching characters in scanned documents is a major problem in designing an effective character segmentation procedure for OCR systems. In this paper, new techniques are presented for identification and s...
详细信息
ISBN:
(纸本)0769512631
Existence of touching characters in scanned documents is a major problem in designing an effective character segmentation procedure for OCR systems. In this paper, new techniques are presented for identification and segmentation of touching characters. The techniques are based on fuzzy multifactorial analysis. A predictive algorithm is developed for effectively selecting cut-points to segment touching characters. Initially, our proposed method has been applied for segmenting touching characters that appear in Devnagari (Hindi) and Bangla, two major scripts in the indian sub-continent. The results obtained from a test-set of considerable size show that a high recognition rate can be achieved with a reasonable amount of computations.
India is a multilingual multiscript country with more than 18 languages and 10 different major scripts. Not enough research work towards recognition of handwritten characters of these indian scripts has been done. Tam...
详细信息
India is a multilingual multiscript country with more than 18 languages and 10 different major scripts. Not enough research work towards recognition of handwritten characters of these indian scripts has been done. Tamil, an official as well as popular script of the southern part of India, Singapore, Malaysia, and Sri Lanka has a large character set which includes many compound characters. Only a few works towards handwriting recognition of this large character set has been reported in the literature. Recently, HP Labs India developed a database of handwritten Tamil characters. In the present paper, we describe an off-line recognition approach based on this database. The proposed method consists of two stages. In the first stage, we apply an unsupervised clustering method to create a smaller number of groups of handwritten Tamil character classes. In the second stage, we consider a supervised classification technique in each of these smaller groups for final recognition. The features considered in the two stages are different. The proposed two-stage recognition scheme provided acceptable classification accuracies on both the training and test sets of the present database.
In the present article, we describe a novel direction code based feature extraction approach for recognition of online Bangla handwritten basic characters. We have implemented the proposed approach on a database of 70...
详细信息
In the present article, we describe a novel direction code based feature extraction approach for recognition of online Bangla handwritten basic characters. We have implemented the proposed approach on a database of 7043 online handwritten Bangla (a major script of the indian subcontinent) character samples, which has been developed by us. This is a 50-class recognition problem and we achieved 93.90% and 83.61% recognition accuracies respectively on its training and test sets.
暂无评论