Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large...
详细信息
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on externalmemoryalgorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide- and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an externalmemory method to simultaneously build the BWT and the LCP array on a collection of mstrings of different lengths. The algorithm over a set of strings having constant length khas O(mkl) time and I/O volume, using O(k + m) main memory, where lis the maximum value in the LCP array. (C) 2020 Elsevier B.V. All rights reserved.
Given an input stream S of size N, a phi-heavy hitter is an item that occurs at least phi N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heav...
详细信息
Given an input stream S of size N, a phi-heavy hitter is an item that occurs at least phi N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = phi N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Omega(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to externalmemory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to externalmemory would be limited by the storage device's random I/O throughput, i.e., approximate to 100K observations per second.
Given an input stream of size N, a phi-heavy hitter is an item that occurs at least phi N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-...
详细信息
ISBN:
(纸本)9781450367356
Given an input stream of size N, a phi-heavy hitter is an item that occurs at least phi N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = phi Nth occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Omega(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to externalmemory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to externalmemory would be limited by the storage device's random I/O throughput, i.e., approximate to 100K observations per second.
Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations....
详细信息
ISBN:
(纸本)9781450361842
Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-Access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the halfbandwidth point, where the latency and bandwidth of the hardware are equal, the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM explains the popularity of B-trees in the 70s and the current popularity of B-epsilon-trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all B-epsilon-trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this paper, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and B-epsilon-trees. Furthermore, the models predict that the B-tree is highly sensitive to variations in the node size whereas B-epsilon-trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, B-epsilon-trees can be optimized so that all operations are simultaneously optimal, even up to lower order terms. In the PDAM model, B-epsilon-trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficie
We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs, called IterTS. In the worst case, our algorithm is extremely inefficient and performs O(n ċ sort(m)) I/Os. However, our experime...
详细信息
We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs, called IterTS. In the worst case, our algorithm is extremely inefficient and performs O(n ċ sort(m)) I/Os. However, our experiments show that IterTS achieves good performance in practice. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/O-efficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/O-efficient implementation of a recent parallel divide-and-conquer algorithm based on reachability queries; and SeTS, a standard DFS-based topological sorting built on top of a semiexternal DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently.
Given a geometric graph G = (S, E) in R-d with constant dilation t, and a positive constant epsilon, we show how to construct a (1 + epsilon)-spanner of G with O(| S|) edges using O(sort(| E|)) memory transfers in the...
详细信息
Given a geometric graph G = (S, E) in R-d with constant dilation t, and a positive constant epsilon, we show how to construct a (1 + epsilon)-spanner of G with O(| S|) edges using O(sort(| E|)) memory transfers in the cache-oblivious model of computation. The main building block of our algorithm, and of independent interest in itself, is a new cacheoblivious algorithm for constructing a well- separated pair decomposition which builds such a data structure for a given point set S C R-d using O(sort(| S|)) memory transfers. (C)2010 Elsevier B. V. All rights reserved.
We present I/O-efficient algorithms for computing planar Steiner spanners for point sets and sets of polygonal obstacles in the plane. (c) 2007 Elsevier B.V. All rights reserved.
We present I/O-efficient algorithms for computing planar Steiner spanners for point sets and sets of polygonal obstacles in the plane. (c) 2007 Elsevier B.V. All rights reserved.
Graph data in modern scientific and engineering applications are often too large to fit in the computer's main memory. Input/output (I/O) complexity is a major research issue in this context. Minimization of the n...
详细信息
Graph data in modern scientific and engineering applications are often too large to fit in the computer's main memory. Input/output (I/O) complexity is a major research issue in this context. Minimization of the number of I/O operations (in externalmemory graph algorithms) is the main focus of current research while classical (internal memory) graph algorithms were designed to minimize temporal complexity. In this paper, we propose an externalmemory depth first search algorithm for general grid graphs. The I/O-complexity of the algorithm is O(sort(N) log(2) N/M), where N = vertical bar V vertical bar + vertical bar E vertical bar, sort(N) = theta(N/M log(M/B) N/B) is the sorting I/O-complexity, M is the memory size, and B is the block size. The best known algorithm for this class of graph is the standard (internal memory) DFS algorithm with appropriate block (sub-grid) I/O-access. Its I/O-complexity is O(N/root B). Recently, the authors proposed an O(sort(N)) algorithm for solid grid graphs. (c) 2007 Elsevier B.V. All rights reserved.
We present an external-memory algorithm to compute a well-separated pair decomposition (WSPD) of a given point set S in R-d in O(sort( N)) I/Os, where N is the number of points in S and sort( N) denotes the I/O-comple...
详细信息
We present an external-memory algorithm to compute a well-separated pair decomposition (WSPD) of a given point set S in R-d in O(sort( N)) I/Os, where N is the number of points in S and sort( N) denotes the I/O-complexity of sorting N items. ( Throughout this paper we assume that the dimension d is fixed.) As applications of the WSPD, we show how to compute a linear-size t-spanner for S within the same I/O-bound and how to solve the K-nearest-neighbour and K-closest-pair problems in O( sort(K N)) and O( sort( N + K)) I/Os, respectively.
In this paper. we propose an externalmemory depth first search algorithm for solid grid graphs, a subclass of grid graphs. The I/O-complexity of the algorithm, is O(sort(N)), where N = \V\ + \E\, sort(N) = Theta(N/B ...
详细信息
In this paper. we propose an externalmemory depth first search algorithm for solid grid graphs, a subclass of grid graphs. The I/O-complexity of the algorithm, is O(sort(N)), where N = \V\ + \E\, sort(N) = Theta(N/B log (M/B) N/B) is the sorting I/O-complexity. M is the memory size, and B is the block size. Since grid graphs might be nonplanar (if diagonal edges intersect), they are beyond the reach of existing planar depth first search algorithms. The best known algorithm for this class of graph is the standard (internal memory) DFS algorithm with appropriate block (sub-grid) I/O-access. Its I/O-complexity is O(N/rootB). (C) 2004 Elsevier B.V. All rights reserved.
暂无评论