We describe the first optimal randomized in-place algorithm for the basic 3-d convex hull problem (and, in particular, for 2-d Voronoi diagrams). The algorithm runs in O(n log n) expected time using only O(1) extra sp...
详细信息
We describe the first optimal randomized in-place algorithm for the basic 3-d convex hull problem (and, in particular, for 2-d Voronoi diagrams). The algorithm runs in O(n log n) expected time using only O(1) extra space;this improves the previous O(n log(3) n) bound by Bronnimann, Chan, and Chen (2004) [10]. The same approach leads to an optimal randomized in-place algorithm for the 2-d line segment intersection problem, with O(n logn + K) expected running time for output size K, improving the previous O(n log(2) n + K) bound by Vahrenhold (2007) [42]. As a bonus, we also point out a simplification of a known optimal cache-oblivious (non-in-place) algorithm by Kumar and Ramos (2002) [33] for 3-d convex hulls, and observe its applicability to 2-d segment intersection, extending a recent result for red/blue segment intersection by Arge, MOlhave, and Zeh (2008) [3]. Our results are all obtained by standard random sampling techniques, with some interesting twists. (C) 2010 Elsevier B.V. All rights reserved.
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallel...
详细信息
ISBN:
(纸本)9781450300797
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallel algorithms that have low depth (span. critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators Using known mappings. our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.
Modern computers employ a sophisticated hierarchy of caches to decrease the latency of memory accesses. This led to the development of cache-oblivious algorithms that strive to achieve the best possible performance on...
详细信息
Modern computers employ a sophisticated hierarchy of caches to decrease the latency of memory accesses. This led to the development of cache-oblivious algorithms that strive to achieve the best possible performance on such memory hierarchies with minimal knowledge of the exact parameters of the hierarchy. A common technique used in the design of cache-oblivious algorithms is a recursion-based divide-and-conquer method. In this work, we show an alternative technique based on the Gray codes. We use the binary reflected Gray code to traverse arrays in the cache-oblivious way, allowing us to design algorithms for problems such as matrix transposition, naive matrix multiplication or naive convolution that match the asymptotic performance of their recursion-based counterparts. The advantage is that our algorithms can be implemented without recursion (or a stack that simulates it) by using a loopless algorithm. We also introduce a variant of the binary reflected Gray code tuned to certain applications of our technique and an almost loopless algorithm to generate it. Apart from the theoretical analysis of our techniques performance, we also examine its practical performance on the problem of matrix transposition.
We develop an optimal cache-oblivious priority queue data structure, supporting insertion, deletion, and delete-min operations in O(1/B log M / B N/B) amortized memory transfers, where M and B are the memory and block...
详细信息
We develop an optimal cache-oblivious priority queue data structure, supporting insertion, deletion, and delete-min operations in O(1/B log M / B N/B) amortized memory transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cache- oblivious data structure, M and B are not used in the description of the structure. Our structure is as efficient as several previously developed external memory (cache-aware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known external memory graph algorithms, and using our cache-oblivious priority queue we develop several cache-oblivious graph algorithms.
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with ...
详细信息
The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average t (q) =1+1/2 (Omega(b)) disk accesses for any load factor alpha bounded away from 1. However, such near-perfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cache-oblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardware-specific tuning, an important feature in autonomous databases. We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cache-oblivious but it only achieves t (q) =1+I similar to(alpha/b) even if a truly random hash function is used. Then we demonstrate that the block probing algorithm (Pagh et al. in SIAM Rev. 53(3):547-558, 2011) achieves t (q) =1+1/2 (Omega(b)), thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2;and (b) every block starts at a memory address divisible by b. Note that the two conditions hold on a real machine, although they are not stated in the cache-oblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is t (q) =1+O(alpha/b), which is exactly what linear probing achieves.
Many-core systems with a rapidly increasing number of cores pose a significant challenge to parallel applications to use their complex memory hierarchies efficiently. Many such applications rely on collective communic...
详细信息
Many-core systems with a rapidly increasing number of cores pose a significant challenge to parallel applications to use their complex memory hierarchies efficiently. Many such applications rely on collective communications in performance-critical phases, which become a bottleneck if they are not optimized. We address this issue by proposing cache-oblivious algorithms for MPI_Alltoall, MPI_Allgather, and the MPI neighborhood collectives to exploit the data locality. To implement the cache-oblivious algorithms, we allocate the send and receive buffers on a shared heap and use Morton order to guide the memory copies. Our analysis shows that our algorithm for MPI_Alltoall is asymptotically optimal. We show an extension to our algorithms to minimize the communication distance on NUMA systems while maintaining optimality within each socket. We further demonstrate how the cache-oblivious algorithms can be applied to multi-node machines. Experiments are conducted on different many-core architectures. For MPI_Alltoall, our implementation achieves on average 1.40X speedup over the naive implementation based on shared heap for small and medium block sizes (less than 16 KB) on a Xeon Phi KNC, achieves on average 3.03X speedup over MVAPICH2 on a Xeon E7-8890, and achieves on average 2.23X speedup over MVAPICH2 on a 256-node Xeon E5-2680 cluster for block sizes less than 1 KB.
Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encrypt...
详细信息
Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. RSA takes its security from the hardness of the Integer Factorisation Problem. As the size of the modulus of an RSA key grows with the number of bytes to be encrypted, the corresponding linear system to be solved in the adversary integer factorisation algorithm also grows. In the age of big data this makes it compelling to redesign linear solvers over finite fields so that they exploit the memory hierarchy. To this end, we examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of Dumas and Roche (2002) requires index conversion routines for which the cost to encode and decode the chosen curve is significant. Using a detailed analysis of the number of bit operations required for the encoding and decoding procedures, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we show that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition as compared to the Hilbert, Peano, or Morton orders. The motivation lies in that cache efficient parallel adaptations for which the natural sequential evaluation order demonstrates lower cache miss rate result in overall faster performance on parallel machines with private or shared caches and on GPU's. (C) 2019 Elsevier Inc. All rights reserved.
We investigate the design of dynamic programming algorithms in unreliable memories, i.e., in the presence of errors that lead the logical state of some bits to be read differently from how they were last written. Assu...
详细信息
We investigate the design of dynamic programming algorithms in unreliable memories, i.e., in the presence of errors that lead the logical state of some bits to be read differently from how they were last written. Assuming that a limited number of memory faults can be inserted at run-time by an adversary with unbounded computational power, we obtain the first resilient algorithms for a broad range of dynamic programming problems, devising a general framework that can be applied to both iterative and recursive implementations. Besides all local dependency problems, where updates to table entries are determined by the contents of neighboring cells, we also settle challenging non-local problems, such as all-pairs shortest paths and matrix multiplication. All our algorithms are correct with high probability and match the running time of their standard non-resilient counterparts while tolerating a polynomial number of faults. The recursive algorithms are also cache-efficient and can tolerate faults at any level of the memory hierarchy. Our results exploit a careful combination of data replication, majority techniques, fingerprint computations, and lazy fault detection. To cope with the complex data access patterns induced by some of our algorithms, we also devise amplified fingerprints, which might be of independent interest in the design of resilient algorithms for different problems.
We implement and undertake an empirical study of the cache-oblivious variant of the polygon indecomposability testing algorithm of Gao and Lauder, based on a depth-first search (DFS) traversal of the computation tree....
详细信息
We implement and undertake an empirical study of the cache-oblivious variant of the polygon indecomposability testing algorithm of Gao and Lauder, based on a depth-first search (DFS) traversal of the computation tree. According to Abu Salem, the cache-oblivious variant exhibits improved spatial and temporal locality over the original one, and its spatial locality is optimal. Our implementation revolves around eight different variants of the DFS-based algorithm, tailored to assess the trade-offs between computation and memory performance as originally proposed by Abu Salem. We analyse performance sensitively to manipulations of the several parameters comprising the input size. We describe how to construct suitably random families of input that solicit such variations, and how to handle redundancies in vector computations at no asymptotic increase in the work and cache complexities. We report extensively on our experimental results. In all eight variants, the DFS-based variant achieves excellent performance in terms of L1 and L2 cache misses as well as total run time, when compared to the original variant of Gao and Lauder. We also benchmark the DFS variant against the powerful computer algebra system MAGMA, in the context of bivariate polynomial irreducibility testing using polygons. For sufficiently high degree polynomials, MAGMA either runs out of memory or fails to terminate after about 4 h of execution. In contrast, the DFS-based version processes such input using a couple of seconds. Particularly, we report on absolute irreducibility testing of bivariate polynomials of total degree reaching 19,000 in about 2 s for the DFS variant, using a single processor.
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications...
详细信息
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.
暂无评论