Bangla is the second most widely spoken language in the indian subcontinent, yet has not been the focus of much research activity in either corpus linguistics or language engineering to date. This paper describes the ...
We propose an approach for understanding mathematical expressions in printed documents. The overall approach is divided into three main steps: (i) detection of mathematical expressions in a document, (ii) recognition ...
详细信息
ISBN:
(纸本)0769507506
We propose an approach for understanding mathematical expressions in printed documents. The overall approach is divided into three main steps: (i) detection of mathematical expressions in a document, (ii) recognition of the symbols present in the expression and (iii) arrangement of the recognized symbols. The detection of mathematical expressions is done through recognition of a few most common symbols and exploiting some structural features of the expressions. A hybrid of feature based and a template-based technique is used for the recognition of symbols. A two-pass approach is used for arrangement of the symbols. The first pass (scanning or lexical analysis) performs a micro-level examination of the symbols in order to identify the symbol groups occurring in them and to determine their categories or descriptors. The second pass (parsing or syntax analysis) processes the descriptors synthesized in the first pass, to determine the syntactic structure of the expression. A set of predefined rules guides the activities in both the passes. Experiments conducted using this approach on a large number of documents show high accuracy.
Over the last decade or so, remarkable developments in computer technology have given a major impetus to research in the field of multimedia. With the proliferation of the Internet and the increasingly widespread use ...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. De...
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official indian la...
There are many types of documents where machine-printed and hand-written texts appear intermixed. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are different, i...
详细信息
In this paper, we propose an approach for understanding mathematical expressions in printed document. The system consists of three main components namely (i) detection of mathematical expressions in a document, (ii) r...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. De...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. Detection of these type styles helps in automatic extraction of the lines containing titles, authors' names, subtitles, references as well as sentences having important terms occurring in the text. It also helps in improving the OCR performance for reading the italic text. Some experimental results on the performance of the approach on good quality as well as degraded document images are presented.
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
详细信息
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi language) text into characters. Based on certain characteristics of Bangla writing methods, different zones across the height of the word are detected. These zones provide certain structural information about the constituent characters of the respective word. In Bangla handwritten texts often there is overlap between rectangular hulls of successive characters. As such the characters are seldom vertically separable. So, we propose a method of recursive contour following in one of the zones across the height of the word to find out the extents within which the main portion of the character lies. If the successive characters are not touching in the zone of contour following, the algorithm gives fairly good results.
暂无评论