We introduce a new text-indexing datastructure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching datastructures. In a short phrase, it is a combination of B-...
详细信息
We introduce a new text-indexing datastructure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching datastructures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
We introduce a new paradigm for querying strings in externalmemory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S-1, . . . , S-n, we aim at supporting a search sequen...
详细信息
We introduce a new paradigm for querying strings in externalmemory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S-1, . . . , S-n, we aim at supporting a search sequence for m not necessarily distinct strings T-1, T-2 , . . . , T-m, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting datastructure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T-1, T-2 , . . . , T-m can be done in an expected number of I/Os: O(Sigma(m)(j=1) vertical bar T-j vertical bar/B + Sigma(n)(j=1) (n(i) log(B) m/n(i))), where each T-j may or may not be present in the dictionary, and n(i) is the number of times Si is queried (i.e., the number of T(j)s equal to S-i). Moreover, inserting or deleting a string S-i takes an expected amortized number O(vertical bar S-i vertical bar/B + log(B)n) of I/Os. The term Sigma(n)(j=1) vertical bar T-j vertical bar/B in the search formula is a lower bound for reading the input, and the term Sigma(n)(i=1) n(i) log(B) m/n(i) (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.
暂无评论