Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an ...
详细信息
We propose new succinct representations of ordinal trees and match various space/time lower bounds. It is known that any n-node static tree can be represented in 2n + o(n) bits so that a number of operations on the tr...
详细信息
We propose new succinct representations of ordinal trees and match various space/time lower bounds. It is known that any n-node static tree can be represented in 2n + o(n) bits so that a number of operations on the tree can be supported in constant time under the word-RAM model. However, the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives that are carried out in constant time on polylog-sized trees. The result is extended to trees of arbitrary size, retaining constant time and reaching 2n + O(n/polylog(n)) bits of space. This space is optimal for a core subset of the operations supported and significantly lower than in any previous proposal. For the dynamic case, where insertion/deletion (indels) of nodes is allowed, the existing data structures support a very limited set of operations. Our data structure builds on the range min-max tree to achieve 2n + O(nllog n) bits of space and O(log n) time for all operations supported in the static scenario, plus indels. We also propose an improved data structure using 2n + O(n log log n/ log n) bits and improving the time to the optimal O(log n/ log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log(1+is an element of) n) for any is an element of> O. Such operations had not been considered before. Our techniques are of independent interest. An immediate derivation yields an improved solution to range minimum/maximum queries where consecutive elements differ by 1, achieving n + O(n/polylog(n)) bits of space. A second one stores an array of numbers supporting operations sum and search and limited updates, in optimal time O(log n/ log log n). A third one allows representing dynamic bitmaps and sequences over alphabets of size sigma, supporting ra
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiti...
详细信息
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T [1, u], then an SLP-compressed representation of T requires 2n log(2) n bits. For that same SLP, our self-index takes O(n log n) + n log(2) u bits. It extracts any text substring of length m in time O ((m + h) log n), and finds occ occurrences of a pattern string of length m in time O ((m (m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log(2) n (1 + o(1)) bits and efficiently supports more operations than a plain array of rules;(ii) a representation for binary relations with labels supporting various extended queries;(iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.
We give new solutions to the SEARCHABLE PARTIAL SUMS WITH INDELS problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert,...
详细信息
We give new solutions to the SEARCHABLE PARTIAL SUMS WITH INDELS problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [2003c] achieving the same space and O(log n/log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k. We also give new solutions to the DYNAMIC SEQUENCE problem. Given a sequence of n symbols in the range [1, sigma] with binary zero-order entropy H-0, we present a dynamic data structure that requires nH(0) + o(n log sigma) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n log sigma) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences. In the case sigma = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity. Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size sigma, of overall length n and hth order empirical entropy H-h. The index requires n H-h + o(n log sigma) bits of space, for any h <= alpha log(sigma) n and constant 0 < alpha < 1. It can count the number of occurrences of a pattern of length m in time O(m log n log sigma). Each such occurrence can be reported in O(log(2) n log log n) time, and displaying a context of length l from a text
We give new solutions to the SEARCHABLE PARTIAL SUMS WITH INDELS problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert,...
详细信息
We give new solutions to the SEARCHABLE PARTIAL SUMS WITH INDELS problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [2003c] achieving the same space and O(log n/log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k. We also give new solutions to the DYNAMIC SEQUENCE problem. Given a sequence of n symbols in the range [1, sigma] with binary zero-order entropy H-0, we present a dynamic data structure that requires nH(0) + o(n log sigma) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n log sigma) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences. In the case sigma = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity. Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size sigma, of overall length n and hth order empirical entropy H-h. The index requires n H-h + o(n log sigma) bits of space, for any h <= alpha log(sigma) n and constant 0 < alpha < 1. It can count the number of occurrences of a pattern of length m in time O(m log n log sigma). Each such occurrence can be reported in O(log(2) n log log n) time, and displaying a context of length l from a text
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use ...
详细信息
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language textdatabases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCM Huff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCM Huff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressedtext, but it gives 2-5% better compression ratios than other techniques for texts longer than 5 MB. (c) 2006 Elsevier Ltd. All rights reserved.
暂无评论