检索结果-内蒙古大学图书馆

Optimal in-place and cache-oblivious algorithms for 3-d convex hulls and 2-d segment intersection

COMPUTATIONAL GEOMETRY-THEORY AND APPLICATIONS 2010年第8期43卷 636-646页

作者： Chan, Timothy M. Chen, Eric Y. Univ Waterloo Cheriton Sch Comp Sci Waterloo ON N2L 3G1 Canada

We describe the first optimal randomized in-place algorithm for the basic 3-d convex hull problem (and, in particular, for 2-d Voronoi diagrams). The algorithm runs in O(n log n) expected time using only O(1) extra space;this improves the previous O(n log(3) n) bound by Bronnimann, Chan, and Chen (2004) [10]. The same approach leads to an optimal randomized in-place algorithm for the 2-d line segment intersection problem, with O(n logn + K) expected running time for output size K, improving the previous O(n log(2) n + K) bound by Vahrenhold (2007) [42]. As a bonus, we also point out a simplification of a known optimal cache-oblivious (non-in-place) algorithm by Kumar and Ramos (2002) [33] for 3-d convex hulls, and observe its applicability to 2-d segment intersection, extending a recent result for red/blue segment intersection by Arge, MOlhave, and Zeh (2008) [3]. Our results are all obtained by standard random sampling techniques, with some interesting twists. (C) 2010 Elsevier B.V. All rights reserved.

关键词： In-place algorithms Convex hulls Voronoi diagrams Segment intersection cache-oblivious algorithms

来源：评论

学校读者我要写书评

暂无评论

Low Depth cache-oblivious algorithms 10

Low Depth Cache-Oblivious Algorithms

引用

22nd ACM Symposium on Parallelism in algorithms and Architectures

作者： Blelloch, Guy E. Gibbons, Phillip B. Simhadri, Harsha Vardhan Carnegie Mellon Univ Pittsburgh PA 15213 USA

ISBN: (纸本)9781450300797

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallel algorithms that have low depth (span. critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators Using known mappings. our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

关键词： cache-oblivious algorithms sorting sparse-matrix vector multiply graph algorithms parallel algorithms multiprocessors schedulers

来源：评论

学校读者我要写书评

暂无评论

Applications of Gray codes in cache-oblivious algorithms

Applications of Gray codes in cache-oblivious algorithms

引用

作者： Mička, Ondřej Charles University of Prague

Modern computers employ a sophisticated hierarchy of caches to decrease the latency of memory accesses. This led to the development of cache-oblivious algorithms that strive to achieve the best possible performance on such memory hierarchies with minimal knowledge of the exact parameters of the hierarchy. A common technique used in the design of cache-oblivious algorithms is a recursion-based divide-and-conquer method. In this work, we show an alternative technique based on the Gray codes. We use the binary reflected Gray code to traverse arrays in the cache-oblivious way, allowing us to design algorithms for problems such as matrix transposition, naive matrix multiplication or naive convolution that match the asymptotic performance of their recursion-based counterparts. The advantage is that our algorithms can be implemented without recursion (or a stack that simulates it) by using a loopless algorithm. We also introduce a variant of the binary reflected Gray code tuned to certain applications of our technique and an almost loopless algorithm to generate it. Apart from the theoretical analysis of our techniques performance, we also examine its practical performance on the problem of matrix transposition.

关键词： Gray codes cache-oblivious algorithms

来源：评论

学校读者我要写书评

暂无评论

An optimal cache-oblivious priority queue and its application to graph algorithms

引用

SIAM JOURNAL ON COMPUTING 2007年第6期36卷 1672-1695页

作者： Arge, Lars Bender, Michael A. Demaine, Erik D. Holland-Minkley, Bryan Munro, J. Ian Univ Aarhus Dept Comp Sci DK-8200 Aarhus N Denmark SUNY Stony Brook Dept Comp Sci Stony Brook NY 11794 USA MIT Comp Sci Lab Cambridge MA 02139 USA Duke Univ Dept Comp Sci Durham NC 27708 USA Univ Waterloo Sch Comp Sci Waterloo ON N2L 3G1 Canada

We develop an optimal cache-oblivious priority queue data structure, supporting insertion, deletion, and delete-min operations in O(1/B log M / B N/B) amortized memory transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cache- oblivious data structure, M and B are not used in the description of the structure. Our structure is as efficient as several previously developed external memory (cache-aware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known external memory graph algorithms, and using our cache-oblivious priority queue we develop several cache-oblivious graph algorithms.

关键词： cache-oblivious algorithms priority queue

来源：评论

学校读者我要写书评

暂无评论

cache-oblivious Hashing

引用

ALGORITHMICA 2014年第4期69卷 864-883页

作者： Pagh, Rasmus Wei, Zhewei Yi, Ke Zhang, Qin IT Univ Copenhagen Copenhagen Denmark Aarhus Univ Dept Comp Sci MADALGO Ctr Mass Data Algorithm DK-8000 Aarhus Denmark Hong Kong Univ Sci & Technol Hong Kong Hong Kong Peoples R China Indiana Univ Bloomington IN USA

The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average t (q) =1+1/2 (Omega(b)) disk accesses for any load factor alpha bounded away from 1. However, such near-perfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cache-oblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardware-specific tuning, an important feature in autonomous databases. We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cache-oblivious but it only achieves t (q) =1+I similar to(alpha/b) even if a truly random hash function is used. Then we demonstrate that the block probing algorithm (Pagh et al. in SIAM Rev. 53(3):547-558, 2011) achieves t (q) =1+1/2 (Omega(b)), thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2;and (b) every block starts at a memory address divisible by b. Note that the two conditions hold on a real machine, although they are not stated in the cache-oblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is t (q) =1+O(alpha/b), which is exactly what linear probing achieves.

关键词： cache-oblivious algorithms Hashing

来源：评论

学校读者我要写书评

暂无评论

cache-oblivious MPI All-to-All Communications Based on Morton Order

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2018年第3期29卷 542-555页

作者： Li, Shigang Zhang, Yunquan Hoefler, Torsten Chinese Acad Sci Inst Comp Technol State Key Lab Comp Architecture Beijing 100864 Peoples R China ETH Dept Comp Sci CH-8092 Zurich Switzerland

Many-core systems with a rapidly increasing number of cores pose a significant challenge to parallel applications to use their complex memory hierarchies efficiently. Many such applications rely on collective communications in performance-critical phases, which become a bottleneck if they are not optimized. We address this issue by proposing cache-oblivious algorithms for MPI_Alltoall, MPI_Allgather, and the MPI neighborhood collectives to exploit the data locality. To implement the cache-oblivious algorithms, we allocate the send and receive buffers on a shared heap and use Morton order to guide the memory copies. Our analysis shows that our algorithm for MPI_Alltoall is asymptotically optimal. We show an extension to our algorithms to minimize the communication distance on NUMA systems while maintaining optimality within each socket. We further demonstrate how the cache-oblivious algorithms can be applied to multi-node machines. Experiments are conducted on different many-core architectures. For MPI_Alltoall, our implementation achieves on average 1.40X speedup over the naive implementation based on shared heap for small and medium block sizes (less than 16 KB) on a Xeon Phi KNC, achieves on average 3.03X speedup over MVAPICH2 on a Xeon E7-8890, and achieves on average 2.23X speedup over MVAPICH2 on a 256-node Xeon E5-2680 cluster for block sizes less than 1 KB.

关键词： cache-oblivious algorithms collective communication NUMA MPI_Alltoall MPI_Allgather neighborhood collectives

来源：评论

学校读者我要写书评

暂无评论

Extending the limits for big data RSA cracking: Towards cache-oblivious TU decomposition

引用

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 2020年第0期138卷 65-77页

作者： Abu Salem, Fatima K. Al Arab, Mira Yang, Laurence T. Amer Univ Beirut Comp Sci Dept Beirut Lebanon St Francis Xavier Univ Dept Comp Sci Antagonish NS Canada

Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encryption, especially for data sent over the internet. RSA takes its security from the hardness of the Integer Factorisation Problem. As the size of the modulus of an RSA key grows with the number of bytes to be encrypted, the corresponding linear system to be solved in the adversary integer factorisation algorithm also grows. In the age of big data this makes it compelling to redesign linear solvers over finite fields so that they exploit the memory hierarchy. To this end, we examine several matrix layouts based on space-filling curves that allow for a cache-oblivious adaptation of parallel TU decomposition for rectangular matrices over finite fields. The TU algorithm of Dumas and Roche (2002) requires index conversion routines for which the cost to encode and decode the chosen curve is significant. Using a detailed analysis of the number of bit operations required for the encoding and decoding procedures, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we show that the Morton-hybrid order incurs the least cost for index conversion routines that are required throughout the matrix decomposition as compared to the Hilbert, Peano, or Morton orders. The motivation lies in that cache efficient parallel adaptations for which the natural sequential evaluation order demonstrates lower cache miss rate result in overall faster performance on parallel machines with private or shared caches and on GPU's. (C) 2019 Elsevier Inc. All rights reserved.

关键词： Exact linear algebra cache-oblivious algorithms Space-filling curves Morton-hybrid order

来源：评论

学校读者我要写书评

暂无评论

Resilient Dynamic Programming

引用

ALGORITHMICA 2017年第2期77卷 389-425页

作者： Caminiti, Saverio Finocchi, Irene Fusco, Emanuele G. Silvestri, Francesco Sapienza Univ Rome Dept Comp Sci Rome Italy Univ Padua Dept Informat Engn Padua Italy IT Univ Copenhagen Copenhagen Denmark

We investigate the design of dynamic programming algorithms in unreliable memories, i.e., in the presence of errors that lead the logical state of some bits to be read differently from how they were last written. Assuming that a limited number of memory faults can be inserted at run-time by an adversary with unbounded computational power, we obtain the first resilient algorithms for a broad range of dynamic programming problems, devising a general framework that can be applied to both iterative and recursive implementations. Besides all local dependency problems, where updates to table entries are determined by the contents of neighboring cells, we also settle challenging non-local problems, such as all-pairs shortest paths and matrix multiplication. All our algorithms are correct with high probability and match the running time of their standard non-resilient counterparts while tolerating a polynomial number of faults. The recursive algorithms are also cache-efficient and can tolerate faults at any level of the memory hierarchy. Our results exploit a careful combination of data replication, majority techniques, fingerprint computations, and lazy fault detection. To cope with the complex data access patterns induced by some of our algorithms, we also devise amplified fingerprints, which might be of independent interest in the design of resilient algorithms for different problems.

关键词： Dynamic programming cache-oblivious algorithms Gaussian Elimination Paradigm Memory faults Resilient computing

来源：评论

学校读者我要写书评

暂无评论

An empirical study of cache-oblivious polygon indecomposability testing

引用

COMPUTING 2010年第1-2期88卷 55-78页

作者： Abu Salem, Fatima K. Soudah, Rawan N. Amer Univ Beirut Dept Comp Sci Beirut 11072020 Lebanon

We implement and undertake an empirical study of the cache-oblivious variant of the polygon indecomposability testing algorithm of Gao and Lauder, based on a depth-first search (DFS) traversal of the computation tree. According to Abu Salem, the cache-oblivious variant exhibits improved spatial and temporal locality over the original one, and its spatial locality is optimal. Our implementation revolves around eight different variants of the DFS-based algorithm, tailored to assess the trade-offs between computation and memory performance as originally proposed by Abu Salem. We analyse performance sensitively to manipulations of the several parameters comprising the input size. We describe how to construct suitably random families of input that solicit such variations, and how to handle redundancies in vector computations at no asymptotic increase in the work and cache complexities. We report extensively on our experimental results. In all eight variants, the DFS-based variant achieves excellent performance in terms of L1 and L2 cache misses as well as total run time, when compared to the original variant of Gao and Lauder. We also benchmark the DFS variant against the powerful computer algebra system MAGMA, in the context of bivariate polynomial irreducibility testing using polygons. For sufficiently high degree polynomials, MAGMA either runs out of memory or fails to terminate after about 4 h of execution. In contrast, the DFS-based version processes such input using a couple of seconds. Particularly, we report on absolute irreducibility testing of bivariate polynomials of total degree reaching 19,000 in about 2 s for the DFS variant, using a single processor.

关键词： Computer algebra Multivariate and bivariate polynomials Absolute irreducibility testing Newton polytopes cache-oblivious algorithms Performance evaluation

来源：评论

学校读者我要写书评

暂无评论

A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops

引用

IEEE SOFTWARE 2011年第1期28卷 39-50页

作者： Luk, Chi-Keung Newton, Ryan Hasenplaugh, William Hampton, Mark Lowney, Geoff Intel

In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.

关键词： Intel Nehalem autotuning cache-oblivious algorithms multicore parallelization simdization throughput computing vectorization Multi-core processors compilers Average speed Self tuning parallelism parallelization computational power

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：