preCICE is an open-source library, that provides comprehensive functionality to couple independent parallelized solver codes to establish a partitioned multi-physics multi-code simulation environment. For data communi...
详细信息
preCICE is an open-source library, that provides comprehensive functionality to couple independent parallelized solver codes to establish a partitioned multi-physics multi-code simulation environment. For data communication between the respective executables at runtime, it implements a peer-to-peer concept, which renders the computational cost of the coupling per time step negligible compared to the typical run time of the coupled codes. To initialize the peer-to-peer coupling, the mesh partitions of the respective solvers need to be compared to determine the point-to-point communication channels between the processes of both codes. This initialization effort can become a limiting factor, if we either reach memory limits or if we have to re-initialize communication relations in every time step. In this contribution, we remove two remaining bottlenecks: (i) We base the neighborhood search between mesh entities of two solvers on a tree data structure to avoid quadratic complexity, and (ii) we replace the sequential gather-scatter comparison of both mesh partitions by a two-level approach that first compares bounding boxes around mesh partitions in a sequential manner, subsequently establishes pairwise communication between processes of the two solvers, and finally compares mesh partitions between connected processes in parallel. We show, that the two-level initialization method is fives times faster than the old one-level scheme on 24,567 CPU-cores using a mesh with 628,898 vertices. In addition, the two-level scheme is able to handle much larger computational meshes, since the central mesh communication of the one-level scheme is replaced with a fully point-to-point mesh communication scheme.
Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto G...
详细信息
ISBN:
(纸本)9781728173832
Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4x better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations.
This work explores the utilization of low-power heterogeneous devices for parallelizing the compute-intensive hyperspectral and multispectral image compression CCSDS-123 entropy encoders. Multithread processing allows...
详细信息
ISBN:
(纸本)9781509066315
This work explores the utilization of low-power heterogeneous devices for parallelizing the compute-intensive hyperspectral and multispectral image compression CCSDS-123 entropy encoders. Multithread processing allows for the near-optimal system's bandwidth to be exploited increasing the system overall performance. The experimental platform consists of a low-power Jetson TX2 GPU equipped with an ARM Cortex-A57 and Denver 2 host processors, reporting more than 1552 Mb/s and, more importantly, 315 Mb/s/W, all running under a global 5 W power budget, which makes it a good candidate for onboard image compression.
A parallel algorithm for solving TSP(traveling salesman problem) is presented in this paper. Combining 2-opt local search optimization with genetic algorithm is the main ideal of this algorithm. In this paper, MPI+TBB...
详细信息
A parallel algorithm for solving TSP(traveling salesman problem) is presented in this paper. Combining 2-opt local search optimization with genetic algorithm is the main ideal of this algorithm. In this paper, MPI+TBB hybrid parallel programming model is employed in implement of our algorithm. Numerical results indicate that it is possible to arrive at high quality solutions in reasonable time. With the increase in the scale of solving problem, the speedup of parallel algorithm is improved. Moreover, with the growth in the number of cores, the speedup of the parallel algorithm presents nearly linear growth.
Task-parallel programming languages offer a variety of high-level mechanisms for synchronization that trade off between flexibility and deadlock safety. Some approaches are deadlock-free by construction but support li...
详细信息
Task-parallel programming languages offer a variety of high-level mechanisms for synchronization that trade off between flexibility and deadlock safety. Some approaches are deadlock-free by construction but support limited synchronization patterns, while other approaches are trivial to deadlock. In high-level task-parallel programming, it is imperative that language features offer both flexibility to avoid over-synchronization and also sufficient protection against logical deadlocks. Lack of flexibility leads to code that does not take full advantage of the available parallelism in the computation. Lack of deadlock protection leads to error-prone code in which a single bug can involve arbitrarily many tasks, making it difficult to reason about. We make advances in both flexibility and deadlock protection for existing synchronization mechanisms by carefully designing dynamically verifiable usage policies and language constructs. We first define a deadlock-freedom policy for futures. The rules of the policy follow naturally from the semantics of asynchronous task closures and correspond to a preorder traversal of the task tree. The policy admits an additional class of deadlock-free programs compared to past work. Each blocking wait for a future can be verified by a stateless, lock-free algorithm, resulting in low time and memory overheads at runtime. In order to define and identify deadlocks for promises, we introduce a mechanism for promises to be owned by tasks. Simple annotations make it possible to ensure that each promise is eventually fulfilled by the responsible task or handed off to another task. Ownership semantics allows us to formally define two kinds of promise bugs: omitted sets and deadlock cycles. We present novel detection algorithms for both bugs. We further introduce an approximate deadlock-freedom policy for promises that, instead of precisely detecting cycles, raises an alarm when synchronization dependences occurring between trees of tasks are a
DVM-system is designed for the development of parallel programs of scientific and technical calculations in C-DVMH and Fortran-DVMH languages. These languages use a single parallel programming model (DVMH model) and a...
详细信息
ParlayLib is a C++ library for developing efficient parallel algorithms and software on shared-memory multicore machines. It provides additional tools and primitives that go beyond what is available in the C++ standar...
详细信息
parallel Computing contributes significantly to most disciplines for solving several scientific problems such as partial differential equations (PDEs), load balancing, and deep learning. The primary characteristic of ...
详细信息
ISBN:
(纸本)9783030451820;9783030451837
parallel Computing contributes significantly to most disciplines for solving several scientific problems such as partial differential equations (PDEs), load balancing, and deep learning. The primary characteristic of parallelism is its ability to ameliorate performance on many different sets of computers. Consequently, many researchers are continually expending their efforts to produce efficient parallel solutions for various problems such as heat equation. Heat equation is a natural phenomenon used in many fields like mathematics and physics. Usually, its associated model is defined by a set of partial differential equations (PDEs). This paper is primarily aimed at showing two parallel programs for solving the heat equation which has been discrete-sized using the finite difference method (FDM). These programs have been implemented through different parallel platforms such as SkelGIS and Compute Unified Device Architecture (CUDA).
Today's processors become fatter, not faster. However, the exploitation of these massively parallel compute resources remains a challenge for many traditional HPC applications regarding scalability, portability an...
详细信息
Sequence alignment is a problem in bioinformatics that involves arranging sequences of proteins, RNA or DNA so that similar regions between two or more sequences may be determined. The Smith-Waterman algorithm is a ke...
详细信息
暂无评论