Neighborhood queries are the most common queries on graphs;thus, it is desirable to answer them efficiently on compressed data structures. Our full paper [1] presents a grammar-based compressor called Incidence-Type-R...
详细信息
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conser...
详细信息
ISBN:
(纸本)9789897583988
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conserved and functional regions. A protein can contain several domains, which are known as structural and functional units. In the previous work, a proteome, whole kinds of proteins in an organism, was regarded as a set of sequences of protein domains, and a grammar-based compression algorithm was developed for a proteome, where production rules in the grammar represented evolutionary processes, mutation and duplication. In this paper, we propose a similarity measure based on the grammar-based compression, and apply it to hierarchical clustering of seven organisms, Homo sapiens. Mus musculus, Drosophila melanogaster, Caenorhabditis elegans. Sacchammyces cerevisiae. Arabidopsis thaliana. and Escherichia coli. The results suggest that our similarity measure could classify the organisms very well.
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary...
详细信息
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary tree sources with certain properties. In this paper, a natural class of binary tree sources is presented such that the demanded properties are fulfilled. Moreover, for both subclasses considered in the paper of Zhang, Yang, and Kieffer, their result is improved by deriving bounds on the maximal pointwise redundancy (or worst-case redundancy) instead of the average-case redundancy. Finally, using context-free tree grammars instead of DAGs, upper bounds on the maximal pointwise redundancy for certain binary tree sources are derived. This yields universal codes for new classes of binary tree sources.
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary...
详细信息
The problem of universal source coding for binary trees is considered. Zhang, Yang, and Kieffer derived upper bounds on the average-case redundancy of codes based on directed acyclic graph (DAG) compression for binary tree sources with certain properties. In this paper, a natural class of binary tree sources is presented such that the demanded properties are fulfilled. Moreover, for both subclasses considered in the paper of Zhang, Yang, and Kieffer, their result is improved by deriving bounds on the maximal pointwise redundancy (or worst-case redundancy) instead of the average-case redundancy. Finally, using context-free tree grammars instead of DAGs, upper bounds on the maximal pointwise redundancy for certain binary tree sources are derived. This yields universal codes for new classes of binary tree sources.
Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered a...
详细信息
Background: Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. Results: Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. Conclusions: We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.
In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(glog(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar gene...
详细信息
In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(glog(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet Sigma of the input string can be identified with numbers from (1,..., N-C) for some constant c. Otherwise, additional cost of O(N log vertical bar Sigma vertical bar) is needed. Algorithms with such an approximation guarantee and running time are known, the novelty of this paper is a particular simplicity of the algorithm as well as the analysis of the algorithm, which uses a general technique of recompression recently introduced by the author. Furthermore, contrary to the previous results, this work does not use the LZ representation of the input string in the construction, nor in the analysis. (C) 2015 Elsevier B.V. All rights reserved.
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually inte...
详细信息
ISBN:
(数字)9783031661594
ISBN:
(纸本)9783031661587;9783031661594
Finding maximal exact matches (MEMs) between strings is an important task in bioinformatics, but it is becoming increasingly challenging as geneticists switch to pangenomic references. Fortunately, we are usually interested only in the relatively few MEMs that are longer than we would expect by chance. In this paper we show that under reasonable assumptions we can find all MEMs of length at least L between a pattern of length m and a text of length n in O(m) time plus extra O(log n) time only for each MEM of length at least nearly L using a compact index for the text, suitable for pangenomics.
grammar-based compression is a widely-accepted model of string compression that allows for efficient and direct manipulations on the compressed data. Most, if not all, such manipulations rely on the primitive random a...
详细信息
ISBN:
(纸本)9783031721991;9783031722004
grammar-based compression is a widely-accepted model of string compression that allows for efficient and direct manipulations on the compressed data. Most, if not all, such manipulations rely on the primitive random access queries, a task of quickly returning the character at a specified position of the original uncompressed string without explicit decompression. While there are advanced data structures for random access to grammar-compressed strings that guarantee theoretical query time and space bounds, little has been done for the practical perspective of this important problem. In this paper, we revisit a wellknown folklore random access algorithm for grammars in the Chomsky normal form, modify it to work directly on general grammars, and show that this modified version is fast and memory efficient in practice.
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algori...
详细信息
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
We show that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context-free grammar fo...
详细信息
We show that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context-free grammar for w of size O(m), whose unique derivation tree has depth O(log n). This solves an open problem in the area of grammar-based compression, improves many results in this area, and greatly simplifies many existing constructions. Similar results are shown for two formalisms for grammar-based tree compression: top dags and forest straight-line programs. These balancing results can be all deduced from a single meta-theorem stating that the depth of an algebraic circuit over an algebra with a certain finite base property can be reduced to O(log n) with the cost of a constant multiplicative size increase. Here, n refers to the size of the unfolding (or unravelling) of the circuit. In particular, this results applies to standard arithmetic circuits over (noncommutative) semirings.
暂无评论