We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar ...
详细信息
ISBN:
(纸本)9798350385885;9798350385878
We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function delta. As a matter of fact, delta itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors. We present a data sketch of O(epsilon(-3) log n + epsilon(-1) log(2) n) words that allows computing a multiplicative (1 +/- epsilon)-approximation of delta with high probability, where n is the string length. The sketches of two strings S-1, S-2 can be merged in O(epsilon(-1) log(2) n) time to yield the sketch of {S-1, S-2}, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(epsilon(-1) log(2) n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating delta, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in O(pn center dot log n) working space and O(epsilon(-1) log(2) n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate delta on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs
暂无评论