检索结果-内蒙古大学图书馆

15th International Scientific Conference on parallel Computational Technologies (PCT)

作者： Perepelkina, Anastasia Levchenko, Vadim D. Keldysh Inst Appl Math Moscow Russia

ISBN: (纸本)9783030816919;9783030816902

algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent. The theory of Locally Recursive non-Locally Asynchronous (LRnLA) algorithms provides the performance model with account for data localization at all levels of the memory hierarchy. However, effective implementation is difficult since modern optimizing compilers do not support the required traversal methods and data structures by default. The data exchange is typically implemented by writing the updated values to the main data array. Here, we suggest a new data structure that contains the partially updated state of the simulation domain. Data is arranged within this structure for coalesced access and seamless exchange between subtasks. We demonstrate the preliminary results of its superiority over previously used methods by localizing the processed data in the L2 GPU cache for the Lattice Boltzmann Method (LBM) simulation so that the performance is not limited by the GDDR throughput but is determined by the L2 cache access rate. If we estimate the ideal stepwise code performance to be memory-bound with a read/write ratio equal to 1 and assume it is localized in the GPU memory and performs at 100% of the theoretical memory bandwidth, then the results of our benchmarks exceed that peak by a factor of the order of 1.2.

关键词： LRnLA algorithms Temporal blocking Loop skewing parallel algorithms Data structure

来源：评论

学校读者我要写书评

暂无评论

parallelising Glauber dynamics

arXiv

引用

arXiv 2023年

作者： Lee, Holden

For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, which resamples a random subset of k coordinates, mixes k times faster in χ2-divergence, and assuming approximate tensorization of entropy, mixes k times faster in KL-divergence. We apply this to obtain parallel algorithms in two settings: (1) For the Ising model (Equation presented) with kJ k Copyright © 2023, The Authors. All rights reserved.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Accelerating Domain Propagation: an Efficient GPU-parallel Algorithm over Sparse Matrices 10

Accelerating Domain Propagation: an Efficient GPU-Parallel A...

引用

10th IEEE/ACM Workshop on Irregular Applications - Architectures and algorithms (IA3)

作者： Sofranac, Boro Gleixner, Ambros Pokutta, Sebastian Berlin Inst Technol Berlin Germany Zuse Inst Berlin Berlin Germany HTW Berlin Berlin Germany

ISBN: (纸本)9781665415576

Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance. Irregularities in the form of dynamic algorithmic behaviour, dependency structures, and sparsity patterns in the input data make efficient implementations of domain propagation on GPUs and, more generally, on parallel architectures challenging. This is one of the main reasons why domain propagation in state-of-the-art solvers is single thread only. In this paper, we present a new algorithm for domain propagation which (a) avoids these problems and allows for an efficient implementation on GPUs, and is (b) capable of running propagation rounds entirely on the GPU, without any need for synchronization or communication with the CPU. We present extensive computational results which demonstrate the effectiveness of our approach and show that ample speedups are possible on practically relevant problems: on state-of-theart GPUs, our geometric mean speed-up for reasonably-large instances is around 10x to 20x and can be as high as 195x on favorably-large instances.

关键词： Mixed Integer Linear Programming MIP GPU Domain Propagation Bound Tightening parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Simple parallel and Distributed algorithms for Spectral Graph Sparsification 14

Simple Parallel and Distributed Algorithms for Spectral Grap...

引用

26th ACM Symposium on parallelism in algorithms and Architectures (SPAA)

作者： Koutis, Ioannis Univ Puerto Rico Rio Piedras Comp Sci Dept San Juan PR 00925 USA

ISBN: (纸本)9781450328210

We describe a simple algorithm for spectral graph sparsification, based on iterative computations of weighted spanners and uniform sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obtain the first distributed spectral sparsification algorithm. We also obtain a parallel algorithm with improved work and time guarantees. Combining this algorithm with the parallel framework of Peng and Spielman for solving symmetric diagonally dominant linear systems, we get a parallel solver which is much closer to being practical and significantly more efficient in terms of the total work.

关键词： parallel algorithms Distributed algorithms Spectral Sparsification SDD linear systems

来源：评论

学校读者我要写书评

暂无评论

parallel accelerated Stokesian dynamics with Brownian motion

引用

JOURNAL OF COMPUTATIONAL PHYSICS 2021年 442卷

作者： Ouaknin, Gaddiel Y. Su, Yu Zia, Roseanna N. Stanford Univ Dept Chem Engn Stanford CA 94305 USA

We present scalable algorithms to simulate large-scale stochastic particle systems amenable for modeling dense colloidal suspensions, glasses and gels. To handle the large number of particles and consequent many-body interactions present in such systems, we leverage an Accelerated Stokesian Dynamics (ASD) approach, for which we developed parallel algorithms in a distributed memory architecture. We present parallelization of the sparse near-field (including singular lubrication) interactions, and of the matrix-free many body far-field interactions, along with a strategy for communicating and mapping the distributed data structures between the near-and far field. Scaling to up to tens of thousands of processors for a million particles is demonstrated. In addition, we propose a novel algorithm to efficiently simulate correlated Brownian motion with hydrodynamic interactions. The original Accelerated Stokesian Dynamics approach requires the separate computation of far-field and near-field Brownian forces. Recent advancements propose computation of a far-field velocity using positive spectral Ewald decomposition. We present an alternative approach for calculating the far-field Brownian velocity by implementing the fluctuating force coupling method and embedding it using a nested scheme into ASD. This straightforward and flexible approach reduces the computational time of the Brownian far field force construction from O(NlogN)(1+vertical bar alpha vertical bar) to O(NlogN). (C) 2021 Elsevier Inc. All rights reserved.

关键词： Stokesian dynamics Hydrodynamics Stochastic calculus parallel algorithms Brownian motion Stokes flow

来源：评论

学校读者我要写书评

暂无评论

A parallel Many-core CUDA-based Graph Labeling Computation 15

A Parallel Many-core CUDA-based Graph Labeling Computation

引用

15th International Conference on Software Technologies (ICSOFT)

作者： Quer, Stefano Politecn Torino Dept Control & Comp Engn DAUIN Turin Italy

ISBN: (纸本)9789897584435

When working on graphs, reachability is among the most common problems to address, since it is the base for many other algorithms. As with the advent of distributed systems, which process large amounts of data, many applications must quickly explore graphs with millions of vertices, scalable solutions have become of paramount importance. Modern GPUs provide highly parallel systems based on many-core architectures and have gained popularity in parallelizing algorithms that run on large data sets. In this paper, we extend a very efficient state-of-the-art graph-labeling method, namely the GRAIL algorithm, to architectures which exhibit a great amount of data parallelism, i.e., many-core CUDA-based GPUs. GRAIL creates a scalable index for answering reachability queries, and it heavily relies on depth-first searches. As depth-first visits are intrinsically recursive and they cannot be efficiently implemented on parallel systems, we devise an alternative approach based on a sequence of breadth-first visits. The paper explores our efforts in this direction, and it analyzes the difficulties encountered and the solutions chosen to overcome them. It also presents a comparison (in terms of times to create the index and to use it for reachability queries) between the CPU and the GPU-based versions.

关键词： Graph Theory Graph algorithms Algorithm Design and Analysis parallel algorithms Graphics Processors

来源：评论

学校读者我要写书评

暂无评论

Scheduling Memory Access Optimization for HBM Based on CLOS

Scheduling Memory Access Optimization for HBM Based on CLOS

引用

International Conference on Advanced Communication Technology (ICACT)

作者： Shuang Xue Huawei Liang Qizhe Wu Xi Jin State Key Laboratory of Particle Detection and Electronics University of Science and Technology of China Hefei China Department of Physics Institute of Microelectronics University of Science and Technology of China Hefei China

With the recent release of FPGA boards based on High Bandwidth Memory (HBM), developers could employ unparalleled external memory bandwidth. HBM provides large-scale aggregated memory bandwidth by exposing multiple memory channels to the processing unit. This allows more memory-constrained applications to benefit from FPGA acceleration. However, it is difficult to take full advantage of available bandwidth: when an application requires multiple processing elements to access multiple HBM channels, the limited number of horizontal connections of the built in crossbar in HBM can result in a significant reduction in effective bandwidth for global addressing. To solve this problem, we propose HBM connection, which is a high-performance custom interconnection for FPGA HBM board. The high-performance custom switching network based on CLOS is introduced to replace the built in crossbar to optimize HBM access scheduling, and increase throughput of AXI bus hosts and switching components. The validity of HBM connection is proved by Xilinx VCU128 HBM board. Based on the breadth-first search BFS case study, we conducte an experimental exploration. The result shows that HBM connection improves the effective performance by 2.5X.

关键词： Memory architecture Bandwidth Switches Throughput Scheduling Communications technology parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Distributed Domain Generation for Large-Scale Scientific Computing 20

Distributed Domain Generation for Large-Scale Scientific Com...

引用

20th International Symposium on parallel and Distributed Computing (ISPDC)

作者： Ertl, Christoph Mundani, Ralf-Peter Tech Univ Munich Chair Computat Modeling & Simulat Munich Germany Univ Appl Sci Grisons Swiss Inst Informat Sci Chur Switzerland

ISBN: (纸本)9781665432818

In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, which are defined by geometric and hierarchical relations between nodes that represent subsets of the computational domain. We generate this domain following a hierarchical spacetree refinement. First, an initial tree is generated on every participating process. Second, this tree is distributed following a space-filling curve linearisation locally. Every process is assigned at least one leaf node of the initial tree, which acts as a starting point for the subsequent domain generation. From here, every process independently refines a subdomain using a decomposition method, which transforms a triangular surface-based geometry description into a volume-based one, using increasingly complex intersection tests. The resulting domain tree is distributed, yet neighbourhood references of neighbouring subtrees are not resolved. We combine the resolution of these relations with a 2:1 tree balancing, which involves the transfer of the surface of neighbouring subtrees. We provide results of a domain generation testcase, using an input geometry with 84,072 triangles on up to 896 processes of the CoolMUC-2 cluster segment of LRZ's Linux Cluster System. Here, we bring down the overall time it takes to generate an adaptively refined and balanced octree with depth d = 7 from 5.5 hours on one process to two seconds on 896 processes.

关键词： Large-scale scientific computing parallel algorithms distributed algorithms distributed systems distributed domain generation spacetrees

来源：评论

学校读者我要写书评

暂无评论

LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales

引用

COMPUTER PHYSICS COMMUNICATIONS 2022年 271卷

作者： Thompson, Aidan P. Aktulga, H. Metin Berger, Richard Bolintineanu, Dan S. Brown, W. Michael Crozier, Paul S. Veld, Pieter J. in 't Kohlmeyer, Axel Moore, Stan G. Nguyen, Trung Dac Shan, Ray Stevens, Mark J. Tranchida, Julien Trott, Christian Plimpton, Steven J. Sandia Natl Labs Albuquerque NM 87185 USA Michigan State Univ E Lansing MI 48824 USA Temple Univ Philadelphia PA 19122 USA Intel Corp Hillsboro OR 97124 USA BASF SE Ludwigshafen Germany Northwestern Univ Evanston IL 60208 USA Mat Design Inc San Diego CA 92131 USA

Since the classical molecular dynamics simulator LAMMPS was released as an open source code in 2004, it has become a widely-used tool for particle-based modeling of materials at length scales ranging from atomic to mesoscale to continuum. Reasons for its popularity are that it provides a wide variety of particle interaction models for different materials, that it runs on any platform from a single CPU core to the largest supercomputers with accelerators, and that it gives users control over simulation details, either via the input script or by adding code for new interatomic potentials, constraints, diagnostics, or other features needed for their models. As a result, hundreds of people have contributed new capabilities to LAMMPS and it has grown from fifty thousand lines of code in 2004 to a million lines today. In this paper several of the fundamental algorithms used in LAMMPS are described along with the design strategies which have made it flexible for both users and developers. We also highlight some capabilities recently added to the code which were enabled by this flexibility, including dynamic load balancing, on-the-fly visualization, magnetic spin dynamics models, and quantum-accuracy machine learning interatomic potentials. Program Summary Program Title: Large-scale Atomic/Molecular Massively parallel Simulator (LAMMPS) CPC Library link to program files: https://doi .org /10 .17632 /cxbxs9btsv.1 Developer's repository link: https://github .com /lammps /lammps Licensing provisions: GPLv2 Programming language: C++, Python, C, Fortran Supplementary material: https://*** .org Nature of problem: Many science applications in physics, chemistry, materials science, and related fields require parallel, scalable, and efficient generation of long, stable classical particle dynamics trajectories. Within this common problem definition, there lies a great diversity of use cases, distinguished by different particle interaction models, external constraints, as well

关键词： Molecular dynamics Materials modeling parallel algorithms LAMMPS

来源：评论

学校读者我要写书评

暂无评论

Application Research of Ant Colony Algorithm in Cross Basin Reservoir Earthquake and Dam Strong Earthquake

Application Research of Ant Colony Algorithm in Cross Basin ...

引用

Integrated Circuits and Communication Systems (ICICACS), IEEE International Conference on

作者： Kouwu Wang Zuwen Chen Jie Huang Xin Luo Weihua Du Aruna T M Power China Guiyang Engineering Corporation Ltd. Guiyang Guizhou China GuiZhou Building Information Model (BIM) Engineering Technology Research Center Guiyang Guizhou China GuiZhou Vocational Technology Institute Guiyang Guizhou China Department of Artificial Intelligence and Machine Learning Nitte Meenakshi Institute of Technology Bengaluru India

Ant colony algorithm is a modern intelligent optimization algorithm, which is essentially a pseudo-random search and parallel algorithm. It makes some complex problems that are difficult to be solved by conventional optimization algorithms better, and improves people's ability to solve practical engineering optimization problems. for the problems of reservoir earthquake and dam strong earthquake, this paper analyzes the optimal solution under different parameter settings through the study of ant colony algorithm and process, combined with sample *** results show that the improved ant colony algorithm can speed up the convergence speed of the algorithm and improve the overall quality of the solution.

关键词： Integrated circuits Dams Communication systems Earthquakes Reservoirs Particle swarm optimization parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：