This paper presents a method to derive efficient algorithms for hypercubes, The method exploits two features of the underlying hardware: a) the parallelism provided by the multiple communication links of each node and...
详细信息
This paper presents a method to derive efficient algorithms for hypercubes, The method exploits two features of the underlying hardware: a) the parallelism provided by the multiple communication links of each node and b) the possibility of overlapping computations and communications which is a feature of machines supporting an asynchronous communication protocol. The method can be applied to a generic class of hypercube algorithms whose distinguishing features are quite frequent in common algorithms for hypercubes. Many examples of this class of algorithms are found in the literature for different problems, The paper shows the efficiency of the method for two case studies. The results show that the reduction in communication overhead is very significant in many cases. They also show that the algorithms produced by our method are always very close to the optimum in terms of execution time. (C) 1998 Elsevier Science B.V. All rights reserved.
We discuss early results with Toucan, a sourceto- source translator that automatically restructures C/C++ MPI applications to overlapcommunication with computation. We co-designed the translator and runtime system to...
详细信息
ISBN:
(纸本)9781538639146
We discuss early results with Toucan, a sourceto- source translator that automatically restructures C/C++ MPI applications to overlapcommunication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants.
We discuss early results with Toucan, a source-to-source translator that automatically restructures C/C++ MPI applications to overlapcommunication with computation. We co-designed the translator and runtime system to...
详细信息
ISBN:
(纸本)9781538639153
We discuss early results with Toucan, a source-to-source translator that automatically restructures C/C++ MPI applications to overlapcommunication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants.
Co-designing applications and computer architectures has become of major importance due to the growing complexity of both applications and architectures and the need to better match application characteristics to the ...
详细信息
ISBN:
(纸本)9781510801011
Co-designing applications and computer architectures has become of major importance due to the growing complexity of both applications and architectures and the need to better match application characteristics to the available hardware. Thus, "mini-applications", which serve as proxies of large-scale ones by highlighting their most intensive parts and major workflow components, appeared to the co-design, tuning, and adaptation purposes. This paper presents a work on optimizing the communication subsystem of a classical MD proxy (CoMD) application executed on multi-core computing clusters. The research focuses on hiding communication with certain buffer handling operations. In particular, two strategies are presented: one that uses two parallel threads for communication and buffer handling and another that introduces more parallelism by allowing all the available threads to unload the buffers while using two thread to communicate, thereby improving load balancing. The first proposed strategy yields performance gains up to 61% in the communication routines, corresponding to 6% gains in the overall time, while the second strategy achieves, respectively, about 73% and 6.3% improvement.
暂无评论