externalmemory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are spec...
详细信息
externalmemory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems.
Novel analytical techniques have dramatically enhanced our understanding of many application domains including biological networks inferred from gene expression studies. However, there are clear computational challeng...
详细信息
ISBN:
(纸本)9783642246685
Novel analytical techniques have dramatically enhanced our understanding of many application domains including biological networks inferred from gene expression studies. However, there are clear computational challenges associated to the large datasets generated from these studies. The algorithmic solution of some NP-hard combinatorial optimization problems that naturally arise on the analysis of large networks is difficult without specialized computer facilities (i.e. supercomputers). In this work, we address the data clustering problem of large-scale biological networks with a polynomial-time algorithm that uses reasonable computing resources and is limited by the available memory. We have adapted and improved the MSTkNN graph partitioning algorithm and redesigned it to take advantage of externalmemory (EM) algorithms. We evaluate the scalability and performance of our proposed algorithm on a well-known breast cancer microarray study and its associated dataset.
The externalmemory BDD package Adiar can manipulate Binary Decision Diagrams (BDDs) larger than the RAM of the machine. To do so, it uses one or more priority queues to defer processing each recursion until the relev...
详细信息
ISBN:
(纸本)9783031661488;9783031661495
The externalmemory BDD package Adiar can manipulate Binary Decision Diagrams (BDDs) larger than the RAM of the machine. To do so, it uses one or more priority queues to defer processing each recursion until the relevant nodes are encountered in a sequential scan. We outline how to improve the performance of Adiar's algorithms if the BDD width of one of its inputs is small enough to fit into main memory. In this case, one of the algorithms' priority queues can entirely be replaced with (levelised) random access to the nodes of the narrow BDD. This preserves the I/O efficiency of the original algorithm, is applicable to other types of decision diagrams, and significantly improves performance for many larger BDD computations.
The construction of suffix trees in secondary storage was considered impractical due to its excessive I/O cost. algorithms developed in the last decade show that a suffix tree can efficiently be built in secondary sto...
详细信息
The construction of suffix trees in secondary storage was considered impractical due to its excessive I/O cost. algorithms developed in the last decade show that a suffix tree can efficiently be built in secondary storage for inputs which fit the main memory. In this paper, we analyze the details of algorithmic approaches to the externalmemory suffix tree construction and compare the performance and scalability of existing state-of-the-art software based on these algorithms. Copyright (C) 2010 John Wiley & Sons, Ltd.
Some recent results (Bauer et al. in algorithms in bioinformatics, Springer, Berlin, pp 326-337, 2012;Cox et al. in algorithms in bioinformatics, Springer, Berlin, pp. 214-224, 2012;Rosone and Sciortino in The nature ...
详细信息
Some recent results (Bauer et al. in algorithms in bioinformatics, Springer, Berlin, pp 326-337, 2012;Cox et al. in algorithms in bioinformatics, Springer, Berlin, pp. 214-224, 2012;Rosone and Sciortino in The nature of computation. Logic, algorithms, applications, Springer, Berlin, pp 353-364, 2013) have introduced external-memoryalgorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of Simpson and Durbin (Bioinformatics 26(12):i367-i373, 2010): to design an external-memory algorithm to compute the string graph.
In this paper, we present two linear-size externalmemory data structures for approximate range searching. Our first structure, the BAR-B-tree, stores a set of N points in a"e (d) and can report all points inside...
详细信息
In this paper, we present two linear-size externalmemory data structures for approximate range searching. Our first structure, the BAR-B-tree, stores a set of N points in a"e (d) and can report all points inside a query range Q by accessing O(log (B) N+epsilon (gamma) +k (epsilon) /B) disk blocks, where B is the disk block size, gamma=1-d for convex queries and gamma=-d otherwise, and k (epsilon) is the number of points lying within a distance of epsilon a <...diam (Q) to the query range Q. Our second structure, the object-BAR-B-tree, is able to store objects of arbitrary shapes of constant complexity and provides similar query guarantees. In addition, both structures can support other types of range searching queries such as range aggregation and nearest-neighbor. Finally, we present I/O-efficient algorithms to build these structures.
We present externalmemory data structures for efficiently answering range-aggregate queries. The range-aggregate problem is defined as follows: Given a set of weighted points in R-d, compute the aggregate of the weig...
详细信息
We present externalmemory data structures for efficiently answering range-aggregate queries. The range-aggregate problem is defined as follows: Given a set of weighted points in R-d, compute the aggregate of the weights of the points that lie inside a d-dimensional orthogonal query rectangle. The aggregates we consider in this paper include COUNT, sum, and MAX. First, we develop a structure for answering two-dimensional range-COUNT queries that uses O(N/B) disk blocks and answers a query in O(log(B) N) I/Os, where N is the number of input points and B is the disk block size. The structure can be extended to obtain a near-linear-size structure for answering range-sum queries using O(log(B) N) I/Os, and a linear-size structure for answering range-MAX queries in O(log(B)(2) N) I/Os. Our structures can be made dynamic and extended to higher dimensions. (C) 2012 Elsevier B.V. All rights reserved.
In recent years a large number of problems have been considered in externalmemory models of computation, where the complexity measure is the number of blocks of data that are moved between slow externalmemory and fa...
详细信息
In recent years a large number of problems have been considered in externalmemory models of computation, where the complexity measure is the number of blocks of data that are moved between slow externalmemory and fast internal memory (also called I/Os). In practice, however, internal memory time often dominates the total running time once I/O-efficiency has been obtained. In this paper we initiate a study of algorithms for fundamental problems that are simultaneously I/O-efficient and internal memory efficient in the RAM model of computation. For sorting the conventional wisdom is to use merging base algorithms in externalmemory but we describe how this leads to suboptimal RAM performance. However, by using a splitting based algorithm in combination with existing RAM sorting techniques we obtain a sorting algorithm that is both I/O and RAM model efficient. Furthermore, we design an I/O- and RAM-efficient priority queue. Finally, we prove a sorting lower bound that shows that in most cases our results are optimal both in terms of I/O and internal computation.
In this paper we study the externalmemory planar point enclosure problem: Given N axis-parallel rectangles in the plane, construct a data structure on disk (an index) such that all K rectangles containing a query poi...
详细信息
In this paper we study the externalmemory planar point enclosure problem: Given N axis-parallel rectangles in the plane, construct a data structure on disk (an index) such that all K rectangles containing a query point can be reported I/O-efficiently. This problem has important applications in e.g. spatial and temporal databases, and is dual to the important and well-studied orthogonal range searching problem. Surprisingly, despite the fact that the problem can be solved optimally in internal memory with linear space and O(log N+K) query time, we show that one cannot construct a linear sized externalmemory point enclosure data structure that can be used to answer a query in O(log (B) N+K/B) I/Os, where B is the disk block size. To obtain this bound, Omega(N/B (1-epsilon) ) disk blocks are needed for some constant epsilon > 0. With linear space, the best obtainable query bound is O(log (2) N+K/B) if a linear output term O(K/B) is desired. To show this we prove a general lower bound on the tradeoff between the size of the data structure and its query cost. We also develop a family of structures with matching space and query bounds.
The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottl...
详细信息
暂无评论