There are many documents where text lines are not parallel to each other i.e. these lines have different inclinations with the horizontal lines (multi-skew documents). For the OCR of such a document we have to estimat...
详细信息
ISBN:
(纸本)0769512631
There are many documents where text lines are not parallel to each other i.e. these lines have different inclinations with the horizontal lines (multi-skew documents). For the OCR of such a document we have to estimate the skew angle of individual text lines because a single rotation cannot de-skew all text lines of the document. In this paper, we describe a robust technique for multi-skew angle detection from Indian documents containing the most popular Indian scripts Devnagari and Bangla. Most characters in these scripts have horizontal lines at the top, called head-lines. The character head-lines usually connect one another in a word and the word appears as a single component. In the proposed method, the connected components are at first labeled and selected. The upper envelopes of selected components are found by column-wise scanning from the top of the component. Portions of the upper envelope satisfying the properties of a digital straight line are detected. They are then clustered into groups belonging to single text lines. Estimates from these individual clusters give the skew angle of each text line. The proposed multi-skew detection technique has an accuracy about 98.3%.
The paper deals with an optical character recognition system for printed Oriya, a popular Indian script. The development of OCR for this script is difficult because a large number of characters have to be recognized. ...
详细信息
ISBN:
(纸本)0769512631
The paper deals with an optical character recognition system for printed Oriya, a popular Indian script. The development of OCR for this script is difficult because a large number of characters have to be recognized. In the proposed system, the digitized document image is first passed through preprocessing modules like skew correction, line segmentation, zone detection, word and character segmentation, etc. These modules have been developed by combining some conventional techniques with some newly proposed ones. Next, individual characters are recognized using a combination of stroke and run-number based features, along with features obtained from the concept of a water reservoir. The feature detection methods are simple and robust. A prototype of the system has been tested on a variety of printed Oriya material, and currently achieves 96.3% character level accuracy on average.
In a general situation, a document page may contain several scriptforms. For optical character recognition (OCR) of such a document page, it is necessary to separate the scripts before feeding them to their individual...
详细信息
In a general situation, a document page may contain several scriptforms. For optical character recognition (OCR) of such a document page, it is necessary to separate the scripts before feeding them to their individual OCR systems. An automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document is proposed. Shape based features, statistical features and some features obtained from the concept of a water reservoir are used for script identification. The proposed scheme has an accuracy of about 97.33%.
A modified Genetic Algorithm (GA) based search strategy is presented here that is computationally more efficient than the conventional GA. Here the idea is to start a GA with the chromosomes of small length. Such chro...
详细信息
Deals with a scheme for automatic segmentation of unconstrained handwritten connected numerals. The scheme is mainly based on features obtained from a new concept based on a water reservoir. A reservoir is a metaphor ...
详细信息
ISBN:
(纸本)0769512631
Deals with a scheme for automatic segmentation of unconstrained handwritten connected numerals. The scheme is mainly based on features obtained from a new concept based on a water reservoir. A reservoir is a metaphor to illustrate the region where numerals touch. The reservoir is obtained by considering accumulation of water poured from the top or from the bottom of the numerals. At first, considering the reservoir location and size, touching positions (top, middle and bottom) are decided. Next, by analyzing the reservoir boundary, touching position and topological features of the touching pattern, the best cutting point is determined. Finally, combined with morphological structural features the cutting path for segmentation is generated.
Bangla is the second most widely spoken language in the Indian subcontinent, yet has not been the focus of much research activity in either corpus linguistics or language engineering to date. This paper describes the ...
We propose an approach for understanding mathematical expressions in printed documents. The overall approach is divided into three main steps: (i) detection of mathematical expressions in a document, (ii) recognition ...
详细信息
ISBN:
(纸本)0769507506
We propose an approach for understanding mathematical expressions in printed documents. The overall approach is divided into three main steps: (i) detection of mathematical expressions in a document, (ii) recognition of the symbols present in the expression and (iii) arrangement of the recognized symbols. The detection of mathematical expressions is done through recognition of a few most common symbols and exploiting some structural features of the expressions. A hybrid of feature based and a template-based technique is used for the recognition of symbols. A two-pass approach is used for arrangement of the symbols. The first pass (scanning or lexical analysis) performs a micro-level examination of the symbols in order to identify the symbol groups occurring in them and to determine their categories or descriptors. The second pass (parsing or syntax analysis) processes the descriptors synthesized in the first pass, to determine the syntactic structure of the expression. A set of predefined rules guides the activities in both the passes. Experiments conducted using this approach on a large number of documents show high accuracy.
Over the last decade or so, remarkable developments in computer technology have given a major impetus to research in the field of multimedia. With the proliferation of the Internet and the increasingly widespread use ...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. De...
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
暂无评论