Text compression algorithm performs compression at the character level. Bangla text has some unique features such as no distinct upper and lower case letter, consonant cluster (CC) and consonant with dependent vowel s...
详细信息
ISBN:
(纸本)9781509056286
Text compression algorithm performs compression at the character level. Bangla text has some unique features such as no distinct upper and lower case letter, consonant cluster (CC) and consonant with dependent vowel sign (CV) etc. The conventional Lempel-Ziv-Welch (LZW) algorithm is not suitable for compressing Bangle text. Therefore, in this paper, we propose a modified LZW (MLZW) algorithm which can compress Bangla text effectively and efficiently. In our proposed method, a dictionary with unicode ranges from 1-90 is used for Bangla characters. The compression process is started with checking the input character. If input character is a part of CC or CV, then CC or CV is considered as a character and search it in the dictionary. If the character to be encoded is already in dictionary, encode it with the dictionary index. Otherwise, the character is added to the dictionary and is encoded with its corresponding dictionary index. Simulation results indicate that the proposed MLZW algorithm compresses Bangla text effectively and efficiently. We observed that the proposed MLZW provides higher compression rate approximately 3% for dictionary index and 33% for output sequence compared with LZW algorithm.
In this note the author describes an initiative to create a keyboard for Android mobile devices that can type characters for a West African language called Kaansa, spoken by perhaps 10,000 Kaan people in Burkina Faso....
详细信息
ISBN:
(纸本)9781450343060
In this note the author describes an initiative to create a keyboard for Android mobile devices that can type characters for a West African language called Kaansa, spoken by perhaps 10,000 Kaan people in Burkina Faso. The Kaan community has only recently established a written orthography and begun formal literacy training for adults and youths. This note examines certain currently available mobile technologies to allow texting in Kaansa and considers future efforts to measure the impact of such technologies on the literacy rate among several demographics.
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text ...
详细信息
ISBN:
(纸本)9781509009220
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semiautomatic approach to identifying and unifying them.
Preserving old archives with readable and editable structure helps people to gain additional experience. Tulu is one of five noteworthy Dravidian dialect with numerous Tulu historical documents which are available wit...
详细信息
ISBN:
(纸本)9781509008490
Preserving old archives with readable and editable structure helps people to gain additional experience. Tulu is one of five noteworthy Dravidian dialect with numerous Tulu historical documents which are available within handwritten form. Tulu scripts are rich in patterns with many combinations of connected characters. Henceforth, machine recognition is a major challenge. Till now, no strategy is reported to recognize the Tulu script which is an ancient script in South India. The main aim of this paper is to introduce the salient features of Tulu script and listing the approaches utilized for handwritten character recognition. Subsequently, giving future research directions on recognition and understanding of Tulu script.
unicode 6.1 (2012) had encoded more than 74,000 Han characters. This great repertory could solve the problem of unencoded Han characters to a significant extent. However, most information systems today still only supp...
详细信息
unicode 6.1 (2012) had encoded more than 74,000 Han characters. This great repertory could solve the problem of unencoded Han characters to a significant extent. However, most information systems today still only support input and display of the first 20,902 encoded Han characters in unicode 1.0 (1991). Even in latest systems, designed to support 32-bit unicode and with suitable fonts installed, it is not easy to use these newly encoded Han characters. We note that many of these newly encoded Han characters are rarely used in users' everyday life. An ordinary user may have confusions of their glyph shapes, pronunciations, meanings, and usages. IMEs (input method editors) for Han characters usually require users to have good knowledge of wanted Han characters. It is not unusual users try but fail to input unfamiliar Han characters. In this paper, we present an auxiliary unicode Han character lookup service by radicals. One can use any Han character IME to key in one or more radicals to look up a wanted Han character. Every unicode Han character is decomposed as a glyph expression of radicals. The similarity between the glyph expression and user input is estimated by a derived edit distance algorithm. The most similar unicode Han characters are returned. As a result, the system provides users a convenient way to look up unfamiliar unicode Han characters.
This paper presents an alternative communication technique to help people suffering from speech and language difficulties for various reasons. Electronic Speech synthesis is a process of generating human like speech f...
详细信息
ISBN:
(纸本)9781509010257
This paper presents an alternative communication technique to help people suffering from speech and language difficulties for various reasons. Electronic Speech synthesis is a process of generating human like speech from any text input to emulate human speaker. The objective of text to speech system is to convert an arbitrary Kannada text into its corresponding spoken waveform, using phoneme as basic unit for speech synthesis. A standard syllable level speech database consisting of 525 syllables is built for synthesizing naturally sounding speech. The main advantage of this system is the real time approach for conversion of entered text to corresponding speech. The initial and the final points of a speech waveform are determined using Maximum energy and zero crossing rate. The Unit selection based concatenation method is opted for syllable concatenation and the system is implemented using MATLAB.
Automatic recognition of handwritten characters from scanned images helps to convert characters in an image into convenient editable and readable form. Tulu is a south Indian Dravidian language with rich set of handwr...
详细信息
ISBN:
(纸本)9781509007745
Automatic recognition of handwritten characters from scanned images helps to convert characters in an image into convenient editable and readable form. Tulu is a south Indian Dravidian language with rich set of handwritten patterns. This paper presents an approach to recognize the Tulu script using automatic character recognition mechanism. The recognition of handwritten Tulu characters is based on the AdaBoost algorithm using Haar features. Finally, recognized characters are mapped into an equivalent editable document of Kannada characters. Hence, make it to readable for the next generation by digital technology.
Character recognition from scanned images is a very complex task. But as for record keeping we require all the data in digital format to perform various manipulation operations. The main issue in case of character rec...
详细信息
ISBN:
(纸本)9781509011117
Character recognition from scanned images is a very complex task. But as for record keeping we require all the data in digital format to perform various manipulation operations. The main issue in case of character recognition is the different styles and fonts in which the text is written. We proposed a new approach by using the concept of Artificial Neural Network and Nearest Neighbour approach for character recognition from scanned images. Three layers are used for classification purpose. First is the input layer consist the input given by the segmented characters, then hidden layer consist the neurons trained by the training network and the output layer consist output neurons to generate unicode.
暂无评论