Most of particle methods share the problem of high computational cost and in order to satisfy the demands of solvers, currently available hardware technologies must be fully exploited. Two complementary technologies a...
详细信息
Most of particle methods share the problem of high computational cost and in order to satisfy the demands of solvers, currently available hardware technologies must be fully exploited. Two complementary technologies are now accessible. On the one hand, CPUs which can be structured into a multi-node framework, allowing massive data exchanges through a high speed network. In this case, each node is usually comprised of several cores available to perform multithreaded computations. On the other hand, GPUs which are derived from the graphics computing technologies, able to perform highly multithreaded calculations with hundreds of independent threads connected together through a common shared memory. This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores. The experience gained clearly shows that parallelizing a particle-based code on moderate numbers of cores can easily lead to an acceptable scalability, whilst a scalable speedup on thousands of cores is much more difficult to obtain. The discussion revolves around speeding up particle methods as a whole, in a massive HPC context by making use of the MPI library. We focus on one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering. (C) 2015 Published by Elsevier B.V.
We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respec...
详细信息
We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and Tasks library maps the chunks and tasks to physical resources. In this way we seek to combine user friendliness with high performance. An application programmer can express a parallel algorithm using a few simple building blocks, defining data and work objects and their relationships. No explicit communication calls are needed;the distribution of both work and data is handled by the Chunks and Tasks library. This makes efficient implementation of complex applications that require dynamic distribution of work and data easier. At the same time, Chunks and Tasks imposes restrictions on data access and task dependencies that facilitate the development of high performance parallel back ends. We discuss the fundamental abstractions underlying the programming model, as well as performance, determinism, and fault resilience considerations. We also present a pilot C++ library implementation for clusters of multicore machines and demonstrate its performance for irregular block-sparse matrix-matrix multiplication. (C) 2013 Elsevier B.V. All rights reserved.
This paper presents a massively parallel approach of the multilevel fast multipole algorithm (PMLFMA) on homegrown many-core SW26010 cluster of China, noted as (SW-PMLFMA), for 3-D electromagnetic scattering problems....
详细信息
This paper presents a massively parallel approach of the multilevel fast multipole algorithm (PMLFMA) on homegrown many-core SW26010 cluster of China, noted as (SW-PMLFMA), for 3-D electromagnetic scattering problems. In this approach, the multilevel fast multipole algorithm (MLFMA) octree is first partitioned among management processing elements (MPEs) of SW26010 processors following the ternary partitioning scheme using the message passing interface (MPI). Then, the computationally intensive parts of the PMLFMA on each MPI process, matrix filling, aggregation and disaggregation are accelerated by using all the 64 computing processing elements (CPEs) in the same core group of the MPE via the Athread parallel programming model. Different parallelization strategies are designed for many-core accelerators to ensure a high computational throughput. In coincidence with the special characteristic of local Lagrange interpolation, the compressed sparse row (CSR) and the compressed sparse column (CSC) sparse matrix storage format is used for storing interpolation and anterpolation matrices, respectively, together with a specially designed cache mechanism of hybrid dynamic and static buffers using the scratchpad memory (SPM) to improve data access efficiency. Numerical results are included to demonstrate the efficiency and versatility of the proposed method. The proposed parallel scheme is shown to have excellent speedup.
In nuclear astrophysics, quantum simulations of large inhomogeneous dense systems as they appear in the crusts of neutron stars present big challenges. The number of particles in a simulation with periodic boundary co...
详细信息
In nuclear astrophysics, quantum simulations of large inhomogeneous dense systems as they appear in the crusts of neutron stars present big challenges. The number of particles in a simulation with periodic boundary conditions is strongly limited due to the immense computational cost of the quantum methods. In this paper, we describe techniques for an efficient and scalable parallel implementation of Sky3D, a nuclear density functional theory solver that operates on an equidistant grid. Presented techniques allow Sky3D to achieve good scaling and high performance on a large number of cores, as demonstrated through detailed performance analysis on a Cray XC40 supercomputer. (C) 2017 Elsevier B.V. All rights reserved.
The Ubiquity Generator (UG) is a general framework for the external parallelization of mixed integer programming (MIP) solvers. In this paper, we present ParaXpress, a distributed memory parallelization of the powerfu...
详细信息
The Ubiquity Generator (UG) is a general framework for the external parallelization of mixed integer programming (MIP) solvers. In this paper, we present ParaXpress, a distributed memory parallelization of the powerful commercial MIP solver FICO Xpress. Besides sheer performance, an important feature of Xpress is that it provides an internal parallelization for shared memory systems. When aiming for a best possible performance of ParaXpress on a supercomputer, the question arises how to balance the internal Xpress parallelization and the external parallelization by UG against each other. We provide computational experiments to address this question and we show computational results for running ParaXpress on a Top500 supercomputer, using up to 43,344 cores in parallel.
The Ubiquity Generator (UG) is a general framework for the external parallelization of mixed integer programming (MIP) solvers. It has been used to develop ParaSCIP, a distributedmemory, massively parallel version of...
详细信息
ISBN:
(纸本)9783319424323;9783319424316
The Ubiquity Generator (UG) is a general framework for the external parallelization of mixed integer programming (MIP) solvers. It has been used to develop ParaSCIP, a distributedmemory, massively parallel version of the open source solver SCIP, running on up to 80,000 cores. In this paper, we present a first implementation of ParaXpress, a distributed memory parallelization of the powerful commercial MIP solver FICO Xpress. Besides sheer performance, an important difference between SCIP and Xpress is that Xpress provides an internal parallelization for shared memory systems. When aiming for a best possible performance of ParaXpress on a supercomputer, the question arises how to balance the internal Xpress parallelization and the external parallelization by UG against each other. We provide computational experiments to address this question and we show preliminary computational results for running a first version of ParaXpress on 6,144 cores in parallel.
This paper aims at comparing the serial, shared memoryparallelization, and distributed memory parallelization of the dynamic programming algorithm for the Knapsack Problem. Knapsack Problem is one of the most popular...
详细信息
ISBN:
(纸本)9781665404761
This paper aims at comparing the serial, shared memoryparallelization, and distributed memory parallelization of the dynamic programming algorithm for the Knapsack Problem. Knapsack Problem is one of the most popular optimization problems. This is the decision-making problem and uses for real-world situations such as business projects, airline cargo business, cryptography, and decision-making industry processes, etc. The algorithm under consideration is the table-based dynamic programming algorithm based on Bellman's optimality principle. We used the C-HF programming language. To solve this problem on shared memory systems, we used the OpenMP. For the distributed memory parallelization, we employed the MPL The structure of the algorithm, the data distribution, synchronization, and communication schemes are explained in detail. Extensive experiments for the developed algorithms were carried out. The obtained results helped to make a comparative analysis of the developed algorithms.
暂无评论