检索结果-内蒙古大学图书馆

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM REVIEW 2009年第1期51卷 129-159页

作者： Datta, Kaushik Kamil, Shoaib Williams, Samuel Oliker, Leonid Shalf, John Yelick, Katherine Univ Calif Berkeley Dept Comp Sci Berkeley CA 94720 USA Univ Calif Berkeley Lawrence Berkeley Lab NERSC CRD Berkeley CA 94720 USA

Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, clue primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide Our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware, algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations, We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory OD Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.

关键词： stencil computations cache blocking time skewing cache-oblivious algorithms performance modeling performance evaluation Intel Itanium2 AMD Opteron IBM Power5 STI Cell

来源：评论

学校读者我要写书评

暂无评论

cache-oblivious selection in sorted X+Y matrices

引用

INFORMATION PROCESSING LETTERS 2008年第2期109卷 87-92页

作者： de Berg, Mark Thite, Shripad Tech Univ Eindhoven Dept Comp Sci Eindhoven Netherlands CALTECH Ctr Math Informat IST Pasadena CA 91125 USA

Let X[0 . . n - 1] and Y[0 . . m - 1] be two sorted arrays, and define the m x n matrix A by A[j][i] = X[i] + Y[j]. Frederickson and Johnson [G.N. Frederickson, D.B. Johnson. Generalized selection and ranking: Sorted matrices, SIAM J. Computing 13 (1984) 14-30] gave an efficient algorithm for selecting the kill smallest element from A. We show how to make this algorithm IO-efficient. Our cache-oblivious algorithm performs O ((m + n)/ B) IOs, where B is the block size of memory transfers. (C) 2008 Elsevier B.V. All rights reserved.

关键词： algorithms cache-oblivious algorithms Matrix selection

来源：评论

学校读者我要写书评

暂无评论

Optimizing graph algorithms for improved cache performance

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2004年第9期15卷 769-782页

作者： Park, JS Penner, M Prasanna, VK Univ Calif Los Angeles Dept Comp Sci Los Angeles CA 90095 USA Univ So Calif Dept Elect Engn Los Angeles CA 90089 USA

In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of Omega(N-3/rootC), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.

关键词： cache-friendly algorithms cache-oblivious algorithms graph algorithms shortest path minimum spanning trees graph matching data layout optimizations algorithm performance

来源：评论

学校读者我要写书评

暂无评论

Parallel Minimum Cuts in Near-linear Work and Low Depth

引用

ACM TRANSACTIONS ON PARALLEL COMPUTING 2021年第2期8卷 1–20页

作者： Geissmann, Barbara Gianinazzi, Lukas Swiss Fed Inst Technol Dept Comp Sci Univ Str 6 CAB Zurich Switzerland

We present the first near-linear work and poly-logarithmic depth algorithm for computing a minimum cut in an undirected graph. Previous parallel algorithms with poly-logarithmic depth required at least quadratic work in the number of vertices. In a graph with n vertices and m edges, our randomized algorithm computes the minimum cut with high probability in O(m log(4) n) work and O(log(3) n) depth. This result is obtained by parallelizing a data structure that aggregates weights along paths in a tree, in addition exploiting the connection between minimum cuts and approximate maximum packings of spanning trees. In addition, our algorithm improves upon bounds on the number of cache misses incurred to compute a minimum cut.

关键词： Minimum cut graph algorithms minimum path data structure parallel algorithms cache-oblivious algorithms

来源：评论

学校读者我要写书评

暂无评论

Pruning spanners and constructing well-separated pair decompositions in the presence of memory hierarchies

引用

JOURNAL OF DISCRETE algorithms 2010年第3期8卷 259-272页

作者： Gieseke, Fabian Gudmundsson, Joachim Vahrenhold, Jan Tech Univ Dortmund Fac Comp Sci LS 11 D-44227 Dortmund Germany NICTA ATP Sydney NSW 2015 Australia

Given a geometric graph G = (S, E) in R-d with constant dilation t, and a positive constant epsilon, we show how to construct a (1 + epsilon)-spanner of G with O(| S|) edges using O(sort(| E|)) memory transfers in the cache-oblivious model of computation. The main building block of our algorithm, and of independent interest in itself, is a new cacheoblivious algorithm for constructing a well- separated pair decomposition which builds such a data structure for a given point set S C R-d using O(sort(| S|)) memory transfers. (C)2010 Elsevier B. V. All rights reserved.

关键词： External-memory algorithms cache-oblivious algorithms Geometric graphs Spanners Well-separated pair decomposition

来源：评论

学校读者我要写书评

暂无评论

cache-oblivious MPI All-to-All Communications on Many-Core Architectures 22

Cache-Oblivious MPI All-to-All Communications on Many-Core A...

引用

22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

作者： Li, Shigang Zhang, Yunquan Hoefler, Torsten Chinese Acad Sci Inst Comp Technol SKL Comp Architecture Beijing Peoples R China Swiss Fed Inst Technol Dept Comp Sci Zurich Switzerland

ISBN: (纸本)9781450344937

In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all to-all operations, in which data blocks are copied into the receive buffers in Morton order to exploit data locality. Experimental results on different many-core architectures show that our cache-oblivious implementations significantly outperform the naive implementations based on shared heap and the highly optimized MPI libraries.

关键词： cache-oblivious algorithms MPI_Alltoall many-core

来源：评论

学校读者我要写书评

暂无评论

Sorting with Asymmetric Read and Write Costs 15

Sorting with Asymmetric Read and Write Costs

引用

27th ACM symposium on Parallelism in algorithms and Architectures (SPAA)

作者： Blelloch, Guy E. Fineman, Jeremy T. Gibbons, Phillip B. Gu, Yan Shun, Julian Carnegie Mellon Univ Pittsburgh PA 15213 USA Georgetown Univ Washington DC 20057 USA Intel Labs Santa Clara CA USA CMU Pittsburgh PA USA

ISBN: (纸本)9781450335881

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in O(n) writes, O(n log n) reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges k > 1 for writing a block of size B to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly k block reads for every block write. Finally, we define a variant of the Ideal-cache model with asymmetric write costs, and present write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield parallel cache complexity bounds for machines with private caches or with a shared cache, respectively.

关键词： Sorting asymmetric read-write costs non-volatile memory persistent memory write-efficient write-avoiding parallel algorithms cache-oblivious algorithms external memory model mergesort sample sort I/O buffer tree FFT matrix multiplication

来源：评论

学校读者我要写书评

暂无评论

Towards Many-Core Implementation of LU Decomposition using Peano Curves

Towards Many-Core Implementation of LU Decomposition using P...

引用

6th ACM International Conference on Computing Frontiers and Workshops

作者： Heinecke, Alexander Bader, Michael Tech Univ Munich Dept Informat D-80290 Munich Germany

ISBN: (纸本)9781605585574

We present Our recent research oil cache-oblivious algorithms and implementations of parallel LU decomposition oil shared-memory multi- and manycore platforms. Our approach uses a block-recursive matrix storage scheme based oil space filling Curves, and thus extends Our work presented at CF'08 [9]. The data structure is based on Peano curves, and is separated into a coarse-grain recursive block-matrix scheme, and a fine-grain iterative order for file elementary matrix blocks. The block element order is derived from the recursive construction of a Peano space-filling Curve. The block matrices are stored in ordinary row-major order, and form elementary data types for the block operations. The block size is chosen to perfectly fit the lowest-level data cache in the CPU's cache hierarchy. All matrix operations on this two-level data structure are implemented via routines working on block matrices as operands, and are optimised assembler to exploit the SIMD capacities of the CPUs. For parallelisation on shared memory platforms, we compare two different OpenMP implementations - one based on OpenMP 2.0, which requires explicit scheduling of the block operations to processor cores. and an implementation that exploits the new task concept in OpenMP 3.0. Performance tests on various platforms ranging from desktop systems to an SGI Altix supercomputer, showed that our implementation 'TifaMMy' optimises the use of the available memory hardware by reducing the bandwidth requirements. Hence, the cache-oblivious approach of TifaMMy is also efficient in the context of multi- and manycore environments. We also demonstrated that the OpenMP 3.0 task concept can lead to both well-structured implementations and competitive parallel efficiency for block-recursive, cache-oblivious algorithms.

关键词： Matrix multiplication LU decomposition multicore parallelisation space-filling curve cache-oblivious algorithms OpenMP matrix data structures

来源：评论

学校读者我要写书评

暂无评论

Closing the Gap Between cache-oblivious and cache-adaptive Analysis 20

Closing the Gap Between Cache-oblivious and Cache-adaptive A...

引用

32nd ACM Symposium on Parallelism in algorithms and Architectures (SPAA)

作者： Bender, Michael A. Chowdhury, Rezaul A. Das, Rathish Johnson, Rob Kuszmaul, William Lincoln, Andrea Liu, Quanquan C. Lynch, Jayson Xu, Helen SUNY Stony Brook Stony Brook NY 11794 USA VMware Res Palo Alto CA USA MIT CSAIL 77 Massachusetts Ave Cambridge MA 02139 USA

ISBN: (纸本)9781450369350

cache-adaptive analysis was introduced to analyze the performance of an algorithm when the cache (or internal memory) available to the algorithm dynamically changes size. These memory-size fluctuations are, in fact, the common case in multi-core machines, where threads share cache and RAM. An algorithm is said to be efficiently cache-adaptive if it achieves optimal utilization of the dynamically changing cache. cache-adaptive analysis was inspired by cache-oblivious analysis. Many (or even most) optimal cache-oblivious algorithms have an (a, b, c)-regular recursive structure. Such (a, b, c)-regular algorithms include Longest Common Subsequence, All Pairs Shortest Paths, Matrix Multiplication, Edit Distance, Gaussian Elimination Paradigm, etc. Bender et al. (2016) showed that some of these optimal cache-oblivious algorithms remain optimal even when cache changes size dynamically, but that in general they can be as much as logarithmic factor away from optimal. However, their analysis depends on constructing a highly structured, worst-case memory profile, or sequences of fluctuations in cache size. These worst-case profiles seem fragile, suggesting that the logarithmic gap may be an artifact of an unrealistically powerful adversary. We close the gap between cache-oblivious and cache-adaptive analysis by showing how to make a smoothed analysis of cache-adaptive algorithms via random reshuffling of memory fluctuations. Remarkably, we also show the limits of several natural forms of smoothing, including random perturbations of the cache size and randomizing the algorithm's starting time. Nonetheless, we show that if one takes an arbitrary profile and performs a random shuffle on when "significant events" occur within the profile, then the shuffled profile becomes optimally cache-adaptive in expectation, even when the initial profile is adversarially constructed. These results suggest that cache-obliviousness is a solid foundation for achieving cache-adaptivity when the me

关键词： cache-adaptive algorithms smoothed analysis cache-oblivious algorithms

来源：评论

学校读者我要写书评

暂无评论

Optimal In-Place algorithms for 3-d Convex Hulls and 2-d Segment Intersection

Optimal In-Place Algorithms for 3-d Convex Hulls and 2-d Seg...

引用

25th Annual Symposium on Computational Geometry

作者： Chan, Timothy M. Chen, Eric Y. Univ Waterloo Sch Comp Sci Waterloo ON N2L 3G1 Canada

ISBN: (纸本)9781605585017

We describe the first optimal randomized in-place algorithm for the basic 3-d convex hull problem (and, in particular, for 2-d Voronoi diagrams). The algorithm runs in O(n log n) expected time using only O(1) extra space;this improves the previous O(n log(3) n) bound by Bronnimann, Chan, and Chen [SoCG'04]. The same approach leads to an optimal randomized in-place algorithm for the 2-d line segment intersection problem, with O(n log n + K) expected running time for output size K, improving the previous O(n log(2) n + K) bound by Vahrenhold [WADS'05]. As a bonus, we also point out a simplification of a known optimal cache-oblivious (non-in-place) algorithm by Kumar and Ramos (2002) for 3-d convex hulls, and observe its applicability to 2-d segment intersection, extending a recent result for red/blue segment intersection by Arge, Molhave, and Zeh [ESA'08]. Our results are all obtained by standard random sampling techniques, with some interesting twists.

关键词： In-place algorithms convex hulls Voronoi diagrams segment intersection cache-oblivious algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：