检索结果-内蒙古大学图书馆

General document retrieval in compact space

ACM Journal of Experimental Algorithmics 2014年第2期19卷 1–46页

作者： Navarro, Gonzalo Puglisi, Simon J. Valenzuela, Daniel Department of Computer Science University of Chile Santiago Chile Department of Computer Science University of Helsinki Finland

Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioinformatics, multimedia databases, and web mining. This requires a more general setup where text and pattern can be general sequences of symbols, and the classical inverted indexes developed for words cannot be applied. While linear-space time-optimal solutions have been developed for most interesting queries in this general case, space usage is a serious problem in practice. In this article, we develop compact data structures that solve various important document retrieval problems on general text collections. More specifically, we provide practical solutions for listing the documents where a query pattern appears, together with its frequency in each document, and for listing k documents where a query pattern appears most frequently. Some of our techniques build on existing theoretical proposals, while others are new. In particular, we introduce a novel grammar-based compressed bitmap representation that may be of independent interest when dealing with repetitive sequences. Ours are the first practical indexes that use less space when the text collection is compressible. Our experimental results show that, in various real-life text collections, our data structures are significantly smaller than the most space-efficient previous solutions, using up to half the space without noticeably increasing the query time. Overall, document listing can be carried out in 10 to 40 milliseconds for patterns that appear 100 to 10,000 times in the collection, whereas top-k retrieval is carried out in k to 10 k milliseconds. © 2014 ACM.

关键词： compressed text databases Document retrieval Top-k retrieval Wavelet trees

来源：评论

学校读者我要写书评

暂无评论

Fully Functional Static and Dynamic Succinct Trees

引用

ACM TRANSACTIONS ON ALGORITHMS 2014年第3期10卷 16-16页

作者： Navarro, Gonzalo Sadakane, Kunihiko Univ Chile Dept Comp Sci Santiago Chile Natl Inst Informat Principles Informat Res Div Chiyoda Ku Tokyo 1018430 Japan

We propose new succinct representations of ordinal trees and match various space/time lower bounds. It is known that any n-node static tree can be represented in 2n + o(n) bits so that a number of operations on the tree can be supported in constant time under the word-RAM model. However, the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives that are carried out in constant time on polylog-sized trees. The result is extended to trees of arbitrary size, retaining constant time and reaching 2n + O(n/polylog(n)) bits of space. This space is optimal for a core subset of the operations supported and significantly lower than in any previous proposal. For the dynamic case, where insertion/deletion (indels) of nodes is allowed, the existing data structures support a very limited set of operations. Our data structure builds on the range min-max tree to achieve 2n + O(nllog n) bits of space and O(log n) time for all operations supported in the static scenario, plus indels. We also propose an improved data structure using 2n + O(n log log n/ log n) bits and improving the time to the optimal O(log n/ log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log(1+is an element of) n) for any is an element of> O. Such operations had not been considered before. Our techniques are of independent interest. An immediate derivation yields an improved solution to range minimum/maximum queries where consecutive elements differ by 1, achieving n + O(n/polylog(n)) bits of space. A second one stores an array of numbers supporting operations sum and search and limited updates, in optimal time O(log n/ log log n). A third one allows representing dynamic bitmaps and sequences over alphabets of size sigma, supporting ra

关键词： Succinct tree representations compressed sequence representations compressed text databases

来源：评论

学校读者我要写书评

暂无评论

Self-Indexed Grammar-Based Compression

引用

FUNDAMENTA INFORMATICAE 2011年第3期111卷 313-337页

作者： Claude, Francisco Navarro, Gonzalo Univ Waterloo David R Cheriton Sch Comp Sci Waterloo ON N2L 3G1 Canada Univ Chile Dept Comp Sci Santiago Chile

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T [1, u], then an SLP-compressed representation of T requires 2n log(2) n bits. For that same SLP, our self-index takes O(n log n) + n log(2) u bits. It extracts any text substring of length m in time O ((m + h) log n), and finds occ occurrences of a pattern string of length m in time O ((m (m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log(2) n (1 + o(1)) bits and efficiently supports more operations than a plain array of rules;(ii) a representation for binary relations with labels supporting various extended queries;(iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.

关键词： Grammar-based Compression Straight-Line Programs Self-Indexes compressed text databases Highly Repetitive Sequences Pattern Matching Data Structures

来源：评论

学校读者我要写书评

暂无评论

Dynamic Entropy-compressed Sequences and Full-text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

引用

14th International Symposium on String Processing and Information Retrieval

作者： Maekinen, Veli Navarro, Gonzalo Univ Helsinki Dept Comp Sci FIN-00014 Helsinki Finland Univ Chile Dept Comp Sci Santiago 2120 Chile

We give new solutions to the SEARCHABLE PARTIAL SUMS WITH INDELS problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-case time, for any k = O(log n). This extends previous results by Hon et al. [2003c] achieving the same space and O(log n/log log n) time complexities for the queries, yet offering complexities for insert and delete that are amortized and worse than ours, and supported only for k = O(1). Our result matches an existing lower bound for large values of k. We also give new solutions to the DYNAMIC SEQUENCE problem. Given a sequence of n symbols in the range [1, sigma] with binary zero-order entropy H-0, we present a dynamic data structure that requires nH(0) + o(n log sigma) bits of space, which is able of performing rank and select, as well as inserting and deleting symbols at arbitrary positions, in O(log n log sigma) time. Our result is the first entropy-bound dynamic data structure for rank and select over general sequences. In the case sigma = 2, where both previous problems coincide, we improve the dynamic solution of Hon et al. [2003c] in that we compress the sequence. The only previous result with entropy-bound space for dynamic binary sequences is by Blandford and Blelloch [2004], which has the same complexities as our structure, but does not achieve constant 1 multiplying the entropy term in the space complexity. Finally, we present a new dynamic compressed full-text self-index, for a collection of texts over an alphabet of size sigma, of overall length n and hth order empirical entropy H-h. The index requires n H-h + o(n log sigma) bits of space, for any h <= alpha log(sigma) n and constant 0 < alpha < 1. It can count the number of occurrences of a pattern of length m in time O(m log n log sigma). Each such occurrence can be reported in O(log(2) n log log n) time, and displaying a context of length l from a text

关键词： compressed dynamic data structures compressed text databases entropy partial sums sequences

来源：评论

学校读者我要写书评

暂无评论

Dynamic Entropy-compressed Sequences and Full-text Indexes

引用

ACM TRANSACTIONS ON ALGORITHMS 2008年第3期4卷 32-32页

作者： Maekinen, Veli Navarro, Gonzalo Univ Helsinki Dept Comp Sci FIN-00014 Helsinki Finland Univ Chile Dept Comp Sci Santiago 2120 Chile

关键词： compressed dynamic data structures compressed text databases entropy partial sums sequences

来源：评论

学校读者我要写书评

暂无评论

Using structural contexts to compress semistructured text collections

引用

INFORMATION PROCESSING & MANAGEMENT 2007年第3期43卷 769-790页

作者： Adiego, Joaquin Navarro, Gonzalo de la Fuente, Pablo Univ Valladolid Depto Informat ETIyT Valladolid 47011 Spain Univ Chile Depto Ciencias Computac Santiago Chile

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCM Huff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCM Huff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2-5% better compression ratios than other techniques for texts longer than 5 MB. (c) 2006 Elsevier Ltd. All rights reserved.

关键词： text compression semistructured documents compressed text databases

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：