This paper introduces a new Lossless Segment Based DNA compression (LSBD) method for compressing the DNA sequences. It stores the individual gene position in a compressed file. Since LSBD method performs a gene wise c...
详细信息
This paper introduces a new Lossless Segment Based DNA compression (LSBD) method for compressing the DNA sequences. It stores the individual gene position in a compressed file. Since LSBD method performs a gene wise compression, further processing of compressed data reduces memory usage. The biggest advantage of this algorithm is that it enables part by part decompression and can work on any sized data. Here the method identifies individual gene location and then constructs triplets that are mapped to an eight bit number. The individual gene information is stored in a pointer table and a pointer is provided to corresponding location in the compressed file. The LSBD technique appropriately compresses the non-base characters and performs well on repeating sequences.
In this paper we propose a BWT-based LZW algorithm for reducing the compressed size and the compression time. BWT and MTF can expose potential redundancies in a given input and then significantly improve the compressi...
详细信息
In this paper we propose a BWT-based LZW algorithm for reducing the compressed size and the compression time. BWT and MTF can expose potential redundancies in a given input and then significantly improve the compression ratio of LZW. In order to avoid the poor matching speed of LZW on long runs of the same character, we propose a variant of RLE named RLE-N. RLE-N does not affect the compression ratio, but it contributes LZW to reduce the execution time obviously. The experimental results show that our algorithm performs well on normal files.
The usual way of ensuring the confidentiality of the compressed data is to encrypt it with a standard encryption algorithm such as the AES. However, the cost of encryption not only brings an additional computational c...
详细信息
The usual way of ensuring the confidentiality of the compressed data is to encrypt it with a standard encryption algorithm such as the AES. However, the cost of encryption not only brings an additional computational complexity, but also lacks the flexibility to perform pattern matching on the compressed data, which is an active research topic in stringology. In this study, we investigate the secure compression solutions, and propose a practical method to keep contents of the compressed data hidden. The method is based on the Burrows - Wheeler transform ({BWT}) such that a randomly selected permutation of the input symbols are used as the lexicographical ordering during the construction. The motivation is the observation that on BWT of an input data it is not possible to perform a successful search nor construct any part of it without the correct knowledge of the character ordering. %Capturing that secret ordering from the BWT is hard. The proposed method is supposed to be is an elegant alternative to the standard encryption approaches with the advantage of supporting the compressed pattern matching, while still pertaining the confidentiality. When the input data is homophonic such that the frequencies of the symbols are flat and the alphabet is sufficiently large, it is possible to unify compression and security in a single framework with the proposed technique instead of the two - level compress - then - encrypt paradigm.
To improve the storage space of finite automata on regular expression matching, the paper researches the main idea of the delayed input DFA algorithm based on bounding default path, and analyses the algorithm problem ...
详细信息
To improve the storage space of finite automata on regular expression matching, the paper researches the main idea of the delayed input DFA algorithm based on bounding default path, and analyses the algorithm problem when bounding small length default path. Then we propose optimized algorithm based on weight first principle and node first principle and assess them on the actual rule set, the results show that the optimized algorithm could effectively improve the compression ratio when the default path is bounded small.
Data compression algorithms were usually designed for data processing symbol by symbol. The input symbols of these algorithms are usually taken from the ASCII table, i.e. the size of the input alphabet is 256 symbols ...
详细信息
Data compression algorithms were usually designed for data processing symbol by symbol. The input symbols of these algorithms are usually taken from the ASCII table, i.e. the size of the input alphabet is 256 symbols which are representable by 8-bit numbers. Several other techniques were developed-syllable-based compression, which uses the syllable as a basic compression symbol, and word-based compression, which uses words as basic symbols. These three approaches are strictly bounded and no overlap is allowed. This may be a problem because it may be helpful to have an overlap between them and use a character-based approach with a few symbols as a sequence of characters. This paper describes an algorithm that looks for the optimal alphabet for different text files. The alphabet may contain characters and 2-grams.
Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of ...
详细信息
Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.
In this paper, we propose a novel approach to automatic generation of aspect-oriented summaries from multiple documents. We first develop an event-aspect LDA model to cluster sentences into aspects. We then use extend...
详细信息
ISBN:
(纸本)9781937284114
In this paper, we propose a novel approach to automatic generation of aspect-oriented summaries from multiple documents. We first develop an event-aspect LDA model to cluster sentences into aspects. We then use extended LexRank algorithm to rank the sentences in each cluster. We use Integer Linear Programming for sentence selection. Key features of our method include automatic grouping of semantically related sentences and sentence ranking based on extension of random walk model. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We compare our method with four baseline methods. Quantitative evaluation based on Rouge metric demonstrates the effectiveness and advantages of our method.
The Normalized compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or imag...
详细信息
The Normalized compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD 3 ) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD 3 performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD 3 as the "essence" of NCD, and evaluate the LD 3 based on language tree experiments, authorship recognition, and genome phylogeny data.
Data compression algorithms were usually designed for data processing symbol by symbol. Symbols are usually characters or bytes, but several other techniques may be used. The most well-known approach is using syllable...
详细信息
Data compression algorithms were usually designed for data processing symbol by symbol. Symbols are usually characters or bytes, but several other techniques may be used. The most well-known approach is using syllables or words as symbols. Another approach is to take 2-grams, 3-grams or any n-grams as a symbols. All these approaches has pros and cons, but none of them is the best for any file. This paper describes approach of evolving alphabet from characters and 2-grams, which is optimal for compressed text files. The efficiency of the approach will be tested on three compression algorithms.
The DSRC standard gives the provision to process messages in XML format instead of the default binary encoding scheme. Although, use of XML encoding provides flexibility, it also introduces bandwidth communication ove...
详细信息
The DSRC standard gives the provision to process messages in XML format instead of the default binary encoding scheme. Although, use of XML encoding provides flexibility, it also introduces bandwidth communication overhead since the message size will increase and end to end communication delay since processing time will also increase. There are many proposed XML compression algorithms to decrease message size but few have been tested for processing and communication delays for inter vehicle communication. In this paper, performance measurements for various compression algorithms on DSRC messages are presented.
暂无评论