Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered a...
详细信息
Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. Results: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. Conclusions: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.
In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(glog(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar gene...
详细信息
In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(glog(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet Sigma of the input string can be identified with numbers from (1,..., N-C) for some constant c. Otherwise, additional cost of O(N log vertical bar Sigma vertical bar) is needed. Algorithms with such an approximation guarantee and running time are known, the novelty of this paper is a particular simplicity of the algorithm as well as the analysis of the algorithm, which uses a general technique of recompression recently introduced by the author. Furthermore, contrary to the previous results, this work does not use the LZ representation of the input string in the construction, nor in the analysis. (C) 2015 Elsevier B.V. All rights reserved.
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algori...
详细信息
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
A space-efficient approximation algorithm for the grammar-based compression problem, which requests for a given string to find a smallest context-free grammar deriving the string, is presented. For the input length it...
详细信息
A space-efficient approximation algorithm for the grammar-based compression problem, which requests for a given string to find a smallest context-free grammar deriving the string, is presented. For the input length it and an optimum CFG size g, the algorithm consumes only O(g log g) space and O(n log*n) time to achieve O((log*n) log n) approximation ratio to the optimum compression, where log*n is the maximum number of logarithms satisfying log log...log n > 1. This ratio is thus regarded to almost O(log n), which is the currently best approximation ratio. While g depends on the string, it is known that g = Omega(log n) and g = O(n/log(k)n) for strings from k-letter alphabet [12].
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiti...
详细信息
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T [1, u], then an SLP-compressed representation of T requires 2n log(2) n bits. For that same SLP, our self-index takes O(n log n) + n log(2) u bits. It extracts any text substring of length m in time O ((m + h) log n), and finds occ occurrences of a pattern string of length m in time O ((m (m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log(2) n (1 + o(1)) bits and efficiently supports more operations than a plain array of rules;(ii) a representation for binary relations with labels supporting various extended queries;(iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.
grammar-based compression is a well-studied technique to construct a context-free grammar (CFG) deriving a given text uniquely. In this work, we propose an online algorithm for grammar-based compression. Our algorithm...
详细信息
grammar-based compression is a well-studied technique to construct a context-free grammar (CFG) deriving a given text uniquely. In this work, we propose an online algorithm for grammar-based compression. Our algorithm guarantees O (log(2) n) approximation ratio for the minimum grammar size, where n is an input size, and it runs in input linear time and output linear space. In addition, we propose a practical encoding, which transforms a restricted CFG into a more compact representation. Experimental results by comparison with standard compressors demonstrate that our algorithm is especially effective for highly repetitive text.
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary...
详细信息
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary tree sources with certain properties. In this paper, a natural class of binary tree sources is presented such that the demanded properties are fulfilled. Moreover, for both subclasses considered in the paper of Zhang, Yang, and Kieffer, their result is improved by deriving bounds on the maximal pointwise redundancy (or worst-case redundancy) instead of the average-case redundancy. Finally, using context-free tree grammars instead of DAGs, upper bounds on the maximal pointwise redundancy for certain binary tree sources are derived. This yields universal codes for new classes of binary tree sources.
We present a simple linear-time algorithm constructing a context-free grammar of size O(g log(N/g)) for the input string of size N, where g the size of the optimal grammar generating this string. The algorithm works f...
详细信息
ISBN:
(纸本)9783642389054;9783642389047
We present a simple linear-time algorithm constructing a context-free grammar of size O(g log(N/g)) for the input string of size N, where g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet S of the input string is a subset of {1, ... , N-c} for some constant c. Algorithms with such an approximation guarantees and running time are known, the novelty of this paper is the particular simplicity of the algorithm as well as the analysis of the algorithm, which uses a general technique of recompression recently introduced by the author. Furthermore, contrary to the previous results, this work does not use the LZ representation of the input string in the construction, nor in the analysis.
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary...
详细信息
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary tree sources with certain properties. In this paper, a natural class of binary tree sources is presented such that the demanded properties are fulfilled. Moreover, for both subclasses considered in the paper of Zhang, Yang, and Kieffer, their result is improved by deriving bounds on the maximal pointwise redundancy (or worst-case redundancy) instead of the average-case redundancy. Finally, using context-free tree grammars instead of DAGs, upper bounds on the maximal pointwise redundancy for certain binary tree sources are derived. This yields universal codes for new classes of binary tree sources.
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conser...
详细信息
ISBN:
(纸本)9789897583988
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conserved and functional regions. A protein can contain several domains, which are known as structural and functional units. In the previous work, a proteome, whole kinds of proteins in an organism, was regarded as a set of sequences of protein domains, and a grammar-based compression algorithm was developed for a proteome, where production rules in the grammar represented evolutionary processes, mutation and duplication. In this paper, we propose a similarity measure based on the grammar-based compression, and apply it to hierarchical clustering of seven organisms, Homo sapiens. Mus musculus, Drosophila melanogaster, Caenorhabditis elegans. Sacchammyces cerevisiae. Arabidopsis thaliana. and Escherichia coli. The results suggest that our similarity measure could classify the organisms very well.
暂无评论