algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent....
详细信息
ISBN:
(纸本)9783030816919;9783030816902
algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent. The theory of Locally Recursive non-Locally Asynchronous (LRnLA) algorithms provides the performance model with account for data localization at all levels of the memory hierarchy. However, effective implementation is difficult since modern optimizing compilers do not support the required traversal methods and data structures by default. The data exchange is typically implemented by writing the updated values to the main data array. Here, we suggest a new data structure that contains the partially updated state of the simulation domain. Data is arranged within this structure for coalesced access and seamless exchange between subtasks. We demonstrate the preliminary results of its superiority over previously used methods by localizing the processed data in the L2 GPU cache for the Lattice Boltzmann Method (LBM) simulation so that the performance is not limited by the GDDR throughput but is determined by the L2 cache access rate. If we estimate the ideal stepwise code performance to be memory-bound with a read/write ratio equal to 1 and assume it is localized in the GPU memory and performs at 100% of the theoretical memory bandwidth, then the results of our benchmarks exceed that peak by a factor of the order of 1.2.
For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, whi...
详细信息
Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance...
详细信息
ISBN:
(纸本)9781665415576
Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance. Irregularities in the form of dynamic algorithmic behaviour, dependency structures, and sparsity patterns in the input data make efficient implementations of domain propagation on GPUs and, more generally, on parallel architectures challenging. This is one of the main reasons why domain propagation in state-of-the-art solvers is single thread only. In this paper, we present a new algorithm for domain propagation which (a) avoids these problems and allows for an efficient implementation on GPUs, and is (b) capable of running propagation rounds entirely on the GPU, without any need for synchronization or communication with the CPU. We present extensive computational results which demonstrate the effectiveness of our approach and show that ample speedups are possible on practically relevant problems: on state-of-theart GPUs, our geometric mean speed-up for reasonably-large instances is around 10x to 20x and can be as high as 195x on favorably-large instances.
We describe a simple algorithm for spectral graph sparsification, based on iterative computations of weighted spanners and uniform sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obta...
详细信息
ISBN:
(纸本)9781450328210
We describe a simple algorithm for spectral graph sparsification, based on iterative computations of weighted spanners and uniform sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obtain the first distributed spectral sparsification algorithm. We also obtain a parallel algorithm with improved work and time guarantees. Combining this algorithm with the parallel framework of Peng and Spielman for solving symmetric diagonally dominant linear systems, we get a parallel solver which is much closer to being practical and significantly more efficient in terms of the total work.
We present scalable algorithms to simulate large-scale stochastic particle systems amenable for modeling dense colloidal suspensions, glasses and gels. To handle the large number of particles and consequent many-body ...
详细信息
We present scalable algorithms to simulate large-scale stochastic particle systems amenable for modeling dense colloidal suspensions, glasses and gels. To handle the large number of particles and consequent many-body interactions present in such systems, we leverage an Accelerated Stokesian Dynamics (ASD) approach, for which we developed parallel algorithms in a distributed memory architecture. We present parallelization of the sparse near-field (including singular lubrication) interactions, and of the matrix-free many body far-field interactions, along with a strategy for communicating and mapping the distributed data structures between the near-and far field. Scaling to up to tens of thousands of processors for a million particles is demonstrated. In addition, we propose a novel algorithm to efficiently simulate correlated Brownian motion with hydrodynamic interactions. The original Accelerated Stokesian Dynamics approach requires the separate computation of far-field and near-field Brownian forces. Recent advancements propose computation of a far-field velocity using positive spectral Ewald decomposition. We present an alternative approach for calculating the far-field Brownian velocity by implementing the fluctuating force coupling method and embedding it using a nested scheme into ASD. This straightforward and flexible approach reduces the computational time of the Brownian far field force construction from O(NlogN)(1+vertical bar alpha vertical bar) to O(NlogN). (C) 2021 Elsevier Inc. All rights reserved.
When working on graphs, reachability is among the most common problems to address, since it is the base for many other algorithms. As with the advent of distributed systems, which process large amounts of data, many a...
详细信息
ISBN:
(纸本)9789897584435
When working on graphs, reachability is among the most common problems to address, since it is the base for many other algorithms. As with the advent of distributed systems, which process large amounts of data, many applications must quickly explore graphs with millions of vertices, scalable solutions have become of paramount importance. Modern GPUs provide highly parallel systems based on many-core architectures and have gained popularity in parallelizing algorithms that run on large data sets. In this paper, we extend a very efficient state-of-the-art graph-labeling method, namely the GRAIL algorithm, to architectures which exhibit a great amount of data parallelism, i.e., many-core CUDA-based GPUs. GRAIL creates a scalable index for answering reachability queries, and it heavily relies on depth-first searches. As depth-first visits are intrinsically recursive and they cannot be efficiently implemented on parallel systems, we devise an alternative approach based on a sequence of breadth-first visits. The paper explores our efforts in this direction, and it analyzes the difficulties encountered and the solutions chosen to overcome them. It also presents a comparison (in terms of times to create the index and to use it for reachability queries) between the CPU and the GPU-based versions.
With the recent release of FPGA boards based on High Bandwidth Memory (HBM), developers could employ unparalleled external memory bandwidth. HBM provides large-scale aggregated memory bandwidth by exposing multiple me...
详细信息
With the recent release of FPGA boards based on High Bandwidth Memory (HBM), developers could employ unparalleled external memory bandwidth. HBM provides large-scale aggregated memory bandwidth by exposing multiple memory channels to the processing unit. This allows more memory-constrained applications to benefit from FPGA acceleration. However, it is difficult to take full advantage of available bandwidth: when an application requires multiple processing elements to access multiple HBM channels, the limited number of horizontal connections of the built in crossbar in HBM can result in a significant reduction in effective bandwidth for global addressing. To solve this problem, we propose HBM connection, which is a high-performance custom interconnection for FPGA HBM board. The high-performance custom switching network based on CLOS is introduced to replace the built in crossbar to optimize HBM access scheduling, and increase throughput of AXI bus hosts and switching components. The validity of HBM connection is proved by Xilinx VCU128 HBM board. Based on the breadth-first search BFS case study, we conducte an experimental exploration. The result shows that HBM connection improves the effective performance by 2.5X.
In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, w...
详细信息
ISBN:
(纸本)9781665432818
In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, which are defined by geometric and hierarchical relations between nodes that represent subsets of the computational domain. We generate this domain following a hierarchical spacetree refinement. First, an initial tree is generated on every participating process. Second, this tree is distributed following a space-filling curve linearisation locally. Every process is assigned at least one leaf node of the initial tree, which acts as a starting point for the subsequent domain generation. From here, every process independently refines a subdomain using a decomposition method, which transforms a triangular surface-based geometry description into a volume-based one, using increasingly complex intersection tests. The resulting domain tree is distributed, yet neighbourhood references of neighbouring subtrees are not resolved. We combine the resolution of these relations with a 2:1 tree balancing, which involves the transfer of the surface of neighbouring subtrees. We provide results of a domain generation testcase, using an input geometry with 84,072 triangles on up to 896 processes of the CoolMUC-2 cluster segment of LRZ's Linux Cluster System. Here, we bring down the overall time it takes to generate an adaptively refined and balanced octree with depth d = 7 from 5.5 hours on one process to two seconds on 896 processes.
Since the classical molecular dynamics simulator LAMMPS was released as an open source code in 2004, it has become a widely-used tool for particle-based modeling of materials at length scales ranging from atomic to me...
详细信息
Since the classical molecular dynamics simulator LAMMPS was released as an open source code in 2004, it has become a widely-used tool for particle-based modeling of materials at length scales ranging from atomic to mesoscale to continuum. Reasons for its popularity are that it provides a wide variety of particle interaction models for different materials, that it runs on any platform from a single CPU core to the largest supercomputers with accelerators, and that it gives users control over simulation details, either via the input script or by adding code for new interatomic potentials, constraints, diagnostics, or other features needed for their models. As a result, hundreds of people have contributed new capabilities to LAMMPS and it has grown from fifty thousand lines of code in 2004 to a million lines today. In this paper several of the fundamental algorithms used in LAMMPS are described along with the design strategies which have made it flexible for both users and developers. We also highlight some capabilities recently added to the code which were enabled by this flexibility, including dynamic load balancing, on-the-fly visualization, magnetic spin dynamics models, and quantum-accuracy machine learning interatomic potentials. Program Summary Program Title: Large-scale Atomic/Molecular Massively parallel Simulator (LAMMPS) CPC Library link to program files: https://doi .org /10 .17632 /cxbxs9btsv.1 Developer's repository link: https://github .com /lammps /lammps Licensing provisions: GPLv2 Programming language: C++, Python, C, Fortran Supplementary material: https://*** .org Nature of problem: Many science applications in physics, chemistry, materials science, and related fields require parallel, scalable, and efficient generation of long, stable classical particle dynamics trajectories. Within this common problem definition, there lies a great diversity of use cases, distinguished by different particle interaction models, external constraints, as well
Ant colony algorithm is a modern intelligent optimization algorithm, which is essentially a pseudo-random search and parallel algorithm. It makes some complex problems that are difficult to be solved by conventional o...
详细信息
Ant colony algorithm is a modern intelligent optimization algorithm, which is essentially a pseudo-random search and parallel algorithm. It makes some complex problems that are difficult to be solved by conventional optimization algorithms better, and improves people's ability to solve practical engineering optimization problems. for the problems of reservoir earthquake and dam strong earthquake, this paper analyzes the optimal solution under different parameter settings through the study of ant colony algorithm and process, combined with sample *** results show that the improved ant colony algorithm can speed up the convergence speed of the algorithm and improve the overall quality of the solution.
暂无评论