Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian la...
There are many types of documents where machine-printed and hand-written texts appear intermixed. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are different, i...
详细信息
In this paper, we propose an approach for understanding mathematical expressions in printed document. The system consists of three main components namely (i) detection of mathematical expressions in a document, (ii) r...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. De...
详细信息
Extraction of some meta-information from printed documents without an OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. Detection of these type styles helps in automatic extraction of the lines containing titles, authors' names, subtitles, references as well as sentences having important terms occurring in the text. It also helps in improving the OCR performance for reading the italic text. Some experimental results on the performance of the approach on good quality as well as degraded document images are presented.
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi langu...
详细信息
Segmentation of handwritten words into characters is one of the important components in handwritten text OCR. In this paper we put forward a method for the segmentation of handwritten Bangla (an Indo-Bangladeshi language) text into characters. Based on certain characteristics of Bangla writing methods, different zones across the height of the word are detected. These zones provide certain structural information about the constituent characters of the respective word. In Bangla handwritten texts often there is overlap between rectangular hulls of successive characters. As such the characters are seldom vertically separable. So, we propose a method of recursive contour following in one of the zones across the height of the word to find out the extents within which the main portion of the character lies. If the successive characters are not touching in the zone of contour following, the algorithm gives fairly good results.
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian la...
详细信息
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian languages. For OCR of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented. At present, the system has an overall accuracy of about 98.5%.
There are many types of documents where machine-printed and hand-written texts appear intermixed. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are different, i...
详细信息
There are many types of documents where machine-printed and hand-written texts appear intermixed. Since the optical character recognition (OCR) methodologies for machine-printed and hand-written texts are different, it is necessary to separate these two types of text before feeding them to the respective OCR systems. In this paper, we present such a scheme for both Bangla and Devnagari characters. The scheme is based on the structural and statistical features of the machine-printed and hand-written text lines. The classification scheme has an accuracy of about 98.3%.
Over the last decade or so, remarkable developments in computer technology have given a major impetus to research in the field of multimedia. With the proliferation of the Internet and the increasingly widespread use ...
详细信息
Over the last decade or so, remarkable developments in computer technology have given a major impetus to research in the field of multimedia. With the proliferation of the Internet and the increasingly widespread use of sophisticated computers, the multimedia revolution has arrived in India as well. It is therefore time to take stock of the situation: to evaluate how existing techniques can be used in the Indian context and to determine what new methods have to be developed. This paper summarizes the current state of multimedia technology in India and points to directions for further work. As more and more people in India begin to use computers and the Internet, multimedia capabilities will start playing a vital role in solving problems in many different areas. Education is probably one of the most important areas where multimedia technology can have a major impact. Already, multimedia educational systems are being developed in Indian languages. Several interactive encyclopaedia-like environments are also being marketed on CD-ROMs, and cover topics ranging from Indian classical music to Indian history, using text, images and sound. Some of the other possible applications of multimedia technology are: the development of digital libraries, news and information dissemination services, medicine, business and commerce, and the entertainment industry. Multimedia information technology is thus poised to become an exciting area for research and development activities in India.
We propose simple and fast algorithms for detection of italic, bold and all-capital words without doing actual character recognition. We present a statistical study which reveals that the detection of such words may p...
详细信息
We propose simple and fast algorithms for detection of italic, bold and all-capital words without doing actual character recognition. We present a statistical study which reveals that the detection of such words may play a key role in automatic information retrieval from documents. Moreover, detection of italic words can be used to improve the recognition accuracy of a text recognition system. Considerable number of document images have been tested and our algorithms give accurate results on all the tested images, and the algorithms are very easy to implement.
暂无评论