检索结果-内蒙古大学图书馆

Maximizing communication-computation overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

引用

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 2017年第6期45卷 1390-1416页

作者： Barigou, Youcef Gabriel, Edgar Univ Houston Dept Comp Sci Houston TX 77204 USA

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication-computation overlap. This paper presents an approach to maximize the communication-computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication-computation overlap identified for each scenario.

关键词： Non-blocking collective operations communication-computation overlap Auto-tuning MPI OpenMP

来源：评论

学校读者我要写书评

暂无评论

MPI-aware Compiler Optimizations for Improving communication-computation overlap 09

MPI-aware Compiler Optimizations for Improving Communication...

引用

ACM SIGARCH International Conference on Supercomputing

作者： Danalis, Anthony Pollock, Lori Swany, Martin Cavazos, John Univ Tennessee Knoxville TN 37996 USA

ISBN: (纸本)9781605584980

Several existing compiler transformations can help improve communication-computation overlap in MPI applications. However, traditional compilers treat calls to the MPI library as a black box with unknown side effects and thus miss potential optimizations. This paper's contributions enable the development of an MPI-aware optimizing compiler that can perform transformations exploiting knowledge of MPI call effects to increase communication-computation overlap. We formulate a set of data flow equations and rules to describe the side effects of key MPI functions so an MPI-aware compiler can automatically assess the safety of transformations. After categorizing existing compiler transformations based on their effect on the application code, we present an optimization algorithm that specifies when and how to apply these optimizing transformations to achieve improved communication-computation overlap. By manually applying the optimization algorithm to kernels extracted from HYCOM and the NAS benchmarks, we show that even when transforming these highly optimized codes, execution time can be decreased by an average of over 30%.

关键词： mpi-aware compiler optimizations data flow analysis communication-computation overlap

来源：评论

学校读者我要写书评

暂无评论

The Impact of Application's micro-Imbalance on the communication-computation overlap

The Impact of Application's micro-Imbalance on the Communica...

引用

19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP)

作者： Subotic, Vladimir Carlos Sancho, Jose Labarta, Jesus Valero, Mateo Barcelona Supercomp Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Supercomp Ctr Barcelona Spain

ISBN: (纸本)9780769543284

Although the community sees overlapping communication and computation as a perspective avenue for advancing parallel execution, it remains unclear what type of applications, under which conditions, and to which extent could benefit from this technique. To tackle this issue, we designed a simulation environment that allowed us to profoundly study overlap. We found out that overlapping potential in an application is determined by the application's parallel behavior and the pattern by which each process locally produces/consumes data involved in communication. We identified two behaviors that directly influence the application's overlapping potential - we name them microscopic imbalance of computation and microscopic imbalance of communication. In an application that expresses some of these two behaviors, a fine-grain overlapping technique can achieve a significant execution speedup, a speedup that can even be higher than 2. We believe that our findings can help a programmer estimate how much his application could benefit from overlap, and therefore decide whether implementing that technique is worth the effort.

关键词： communication-computation overlap MPI

来源：评论

学校读者我要写书评

暂无评论

A framework for characterizing overlap of communication and computation in parallel applications

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2008年第1期11卷 75-90页

作者： Shet, Aniruddha G. Sadayappan, P. Bernholdt, David E. Nieplocha, Jarek Tipparaju, Vinod Ohio State Univ Dept Comp Sci & Engn Columbus OH 43210 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA Pacific NW Natl Lab Appl Comp Sci Grp Richland WA 99352 USA

Effective overlap of computation and communication is a well understood technique for latency hiding and can yield significant performance gains for applications on high-end computers. In this paper, we propose an instrumentation framework for message-passing systems to characterize the degree of overlap of communication with computation in the execution of parallel applications. The inability to obtain precise time-stamps for pertinent communication events is a significant problem, and is addressed by generation of minimum and maximum bounds on achieved overlap. The overlap measures can aid application developers and system designers in investigating scalability issues. The approach has been used to instrument two MPI implementations as well as the ARMCI system. The implementation resides entirely within the communication library and thus integrates well with existing approaches that operate outside the library. The utility of the framework is demonstrated by analyzing communication-computation overlap for micro-benchmarks and the NAS benchmarks, and the insights obtained are used to modify the NAS SP benchmark, resulting in improved overlap.

关键词： communication-computation overlap latency hiding performance instrumentation and monitoring parallel applications

来源：评论

学校读者我要写书评

暂无评论

A Compiler Transformation to overlap communication with Dependent computation 9

A Compiler Transformation to Overlap Communication with Depe...

引用

2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS)

作者： Murthy, Karthik Mellor-Crummey, John Rice Univ Houston TX 77251 USA

ISBN: (纸本)9781509001859

Hiding communication latency is essential to achieve scalable performance on current and future parallel systems. In this extended abstract, we present a novel compiler transformation that overlaps communication with computation to hide communication latency. Unlike prior work, we are able to achieve this overlap even in the presence of an overlap-inhibiting data dependence between the communication and computation. We do so by transforming the data dependence into an overlap-amenable one. To achieve this overlap, the Maunam compiler transforms the code by employing array expansion, partial loop peeling, loop alignment, and array contraction. This transformation is useful for optimization of systolic, communication avoiding algorithms.

关键词： Compilers communication-computation overlap loop alignment parallel code generation

来源：评论

学校读者我要写书评

暂无评论

A framework for characterizing overlap of communication and computation in parallel applications

A framework for characterizing overlap of communication and ...

引用

IEEE International Conference on Cluster Computing

关键词： communication-computation overlap latency hiding performance instrumentation and monitoring parallel applications

来源：评论

学校读者我要写书评

暂无评论

IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiency

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2022年第4期25卷 2683-2697页

作者： Medvedev, Alexey, V Lomonosov Moscow State Univ Inst Mech Moscow Russia

The article presents design and methodology of a novel benchmark suite named IMB-ASYNC. The presented suite and method are aimed at measuring and comparing practical communication-computation overlap levels for Message Passing Interface standard (MPI) implementations with a special accent to some applicable use cases. Some typical MPI communication patterns implying communication-computation overlap are analyzed, and their reflection on a benchmark structure is proposed. We also analyze the previous works on overlap benchmarking and their best practices. We present a new benchmarking approach for non-blocking neighborhood collectives overlap and clarify the overlap estimation methodology. After a short overview of some technical details of currently available MPI asynchronous progress implementations, two benchmarking case studies are presented to illustrate the relevance of the methodology.

关键词： HPC clusters MPI Asynchronous progress MPI non-blocking communication-computation overlap

来源：评论

学校读者我要写书评

暂无评论

Optimal orthogonal tiling of 2-D iterations

引用

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 1997年第2期45卷 159-165页

作者： Andonov, R Rajopadhye, S LIMAV Valenciennes France IRISA Rennes France

Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile period P-t and intertile latency L-t. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation. (C) 1997 Academic Press.

关键词： coarse grain pipelining SPMD programs loop blocking nonlinear optimization communication-computation overlap supernode partitioning automatic parallelization macro-systolic arrays

来源：评论

学校读者我要写书评

暂无评论

Static tiling for heterogeneous computing platforms

引用

PARALLEL COMPUTING 1999年第5期25卷 547-568页

作者： Boulet, P Dongarra, J Vivien, F Ecole Normale Super Lyon LIP F-69364 Lyon 07 France Univ Lille 1 LIFL F-59655 Villeneuve Dascq France Univ Tennessee Dept Comp Sci Knoxville TN 37996 USA Oak Ridge Natl Lab Math Sci Sect Oak Ridge TN 37831 USA Univ Strasbourg ICPS F-67400 Illkirch Graffenstaden France

In the framework of fully permutable loops, tiling has been extensively studied as a source-to-source program transformation. However, little work has been devoted to the mapping and scheduling of the tiles on physical processors. Moreover, targeting heterogeneous computing platforms has to the best of our knowledge, never been considered. In this paper we extend static tiling techniques to the context of limited computational resources with different-speed processors. In particular, we present efficient scheduling and mapping strategies that are asymptotically optimal. The practical usefulness of these strategies is fully demonstrated by MPI experiments on a heterogeneous network of workstations. (C) 1999 Elsevier Science B.V. All rights reserved.

关键词： tiling communication-computation overlap mapping limited resources different-speed processors heterogeneous networks

来源：评论

学校读者我要写书评

暂无评论

Reducing communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2 13

Reducing Communication Overhead in the High Performance Conj...

引用

13th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES)

作者： Liu, Fangfang Yang, Chao Liu, Yiqun Zhang, Xianyi Lu, Yutong Chinese Acad Sci Inst Software Beijing 100190 Peoples R China Chinese Acad Sci State Key Lab Comp Sci Beijing 100190 Peoples R China Univ Chinese Acad Sci Beijing 100049 Peoples R China Natl Univ Def Technol Dept Comp Sci & Technol Changsha 410073 Hunan Peoples R China

ISBN: (纸本)9781479941698

The High Performance Conjugate Gradient (HPCG) benchmark, proposed recently in 2013, has drawn increasingly more attention from both academia and industry. Unlike the High Performance Linpack (HPL) benchmark, which has a very high computation-to-communication ratio, HPCG contains both neighboring and global communication that may severely degrade the parallel performance. To reduce the communication overhead of neighboring communications, we overlap halo updates with halo-independent computations. To hide the cost of the global reductions in vector dot-products, we make use of two reformulated CG algorithms, namely the Gropp's asynchronous CG and the pipelined CG. Some further optimizations are done to decrease the extra overhead introduced in the reformulated CG algorithms. We show by experiments on the world's largest heterogeneous system - Tianhe-2 that the optimized HPCG code scales to 256 nodes (49,920 cores) with a nearly ideal weak scalability of over 90% and an aggregate performance of 10.51Tflops.

关键词： HPCG communication-computation overlap pipelined CG asynchronous CG Tianhe-2

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：