Particle shape plays a vital role in granular material behavior, but simulations with realistic particle shapes are uncommon due to significant computational demands of complex particle geometry representation. In thi...
详细信息
Particle shape plays a vital role in granular material behavior, but simulations with realistic particle shapes are uncommon due to significant computational demands of complex particle geometry representation. In this work, BLOKS3D, a polyhedral Discrete Element Method (DEM) and impulse-based DEM (iDEM) codes are parallelized to enable large-scale simulations with realistic particle shapes on readily accessible multi-core machines. Data structures used in the original codes were redesigned and optimized, leading to 15% improved performance of the original serial codes. New parallel algorithms were developed resulting in 28 times performance improvement on a 48-core (quad-CPU) sharedmemory system over single core serial algorithm. The parallelized 3D polyhedral DEM and iDEM were applied to series of column collapse simulations. The codes successfully reproduced the runout distance in granular column collapse experiments. The particle force data from both parallelized DEM and iDEM matched data from the serial algorithm. The new parallel implementation of iDEM was then demonstrated with unprecedented 52 million 3D polyhedral particles simulations. This work will benefit future granular material studies with the newly introduced capacity to run large-scale simulations with realistic particle shapes on sharedmemory hardware platforms readily accessible to many engineers and researchers.
Cavitating and bubbly flows are encountered in many engineering problems involving propellers, pumps, valves, ultrasonic biomedical applications, etc. In this contribution, an OPENMP parallelized Euler-Lagrange model ...
详细信息
Cavitating and bubbly flows are encountered in many engineering problems involving propellers, pumps, valves, ultrasonic biomedical applications, etc. In this contribution, an OPENMP parallelized Euler-Lagrange model of two-phase flow problems and cavitation is presented. The two-phase medium is treated as a continuum and solved on an Eulerian grid, while the discrete bubbles are tracked in a Lagrangian fashion with their dynamics computed. The intimate coupling between the two description levels is realized through the local void fraction, which is computed from the instantaneous bubble volumes and locations, and provides the continuum properties. Since, in practice, any such flows will involve large numbers of bubbles, schemes for significant speedup are needed to reduce computation times. We present here a shared-memory parallelization scheme combining domain decomposition for the continuum domain and number decomposition for the bubbles;both selected to realize maximum speedup and good load balance. The Eulerian computational domain is subdivided based on geometry into several subdomains, while for the Lagrangian computations, the bubbles are subdivided based on their indices into several subsets. The number of fluid subdomains and bubble subsets matches with the number of central processing unit (CPU) cores available in a shared-memory system. Computation of the continuum solution and the bubble dynamics proceeds sequentially. During each computation time step, all selected OPENMP threads are first used to evolve the fluid solution, with each handling one subdomain. Upon completion, the OPENMP threads selected for the Lagrangian solution are then used to execute the bubble computations. All data exchanges are executed through the sharedmemory. Extra steps are taken to localize the memory access pattern to minimize nonlocal data fetch latency, since severe performance penalty may occur on a nonuniform memory architecture (NUMA) multiprocessing system where thre
Scheduling problems occur in a broad range of real-world application fields and have attracted a huge set of research articles. However, there is only little research on exact algorithms for scheduling problems, many ...
详细信息
Scheduling problems occur in a broad range of real-world application fields and have attracted a huge set of research articles. However, there is only little research on exact algorithms for scheduling problems, many of which are NP-hard in the strong sense. We investigate the problem on a single machine with a total weighted tardiness objective function and sequence-dependent setup times. First, we adopt a serial branch-and-price algorithm from the literature and present a modified branching strategy and a primal heuristic. Second, we use the potential of parallel computing architectures by presenting two parallel versions of the branch- and-price algorithm. Third, we conduct extensive computational experiments to show that our parallelization approaches provide substantial parallel speedups on well-known benchmark instances from the literature. We further observe that the parallel speedups achieved by our parallel algorithms are very robust among all tested instances.
OpenMP is a parallel computing framework that provides programmers with a set of directives and clauses to use when writing parallel applications. The most important task in adopting OpenMP is deciding the parallel pa...
详细信息
OpenMP is a parallel computing framework that provides programmers with a set of directives and clauses to use when writing parallel applications. The most important task in adopting OpenMP is deciding the parallel pattern with associated clauses to employ in a sequential program that already exists. The shared-memory parallelization is complicated by parallel directives with different roles. Some tools have been developed to assist programmers in developing parallel programs using OpenMP. Many tools, however, have constraints on the size of program analysis, OpenMP scoping, and scalar and array reduction. Manually selecting clauses with the necessary data-sharing attributes is also prone to errors. In this study, we target the variable classification in directives to explore the loop-level parallelism. We set the variable classification problem as a type inference task based on a machine learning method, which understands the attributes of variables in certain contexts and relations. We propose an aligned corpus of tokens and types to predict variable attributes used inside the target loop. We support the reduction clause whenever it is applicable. Experimental results indicate that our method is very promising and favorably suited to dealing with real-world complex programs while showing high accuracy.(c) 2022 Elsevier B.V. All rights reserved.
The recently developed Hierarchical Poincare-Steklov (HPS) method is a high-order discretization technique that comes with a direct solver. Results from previous papers demonstrate the method's ability to solve He...
详细信息
The recently developed Hierarchical Poincare-Steklov (HPS) method is a high-order discretization technique that comes with a direct solver. Results from previous papers demonstrate the method's ability to solve Helmholtz problems to high accuracy without the so-called pollution effect. While the asymptotic scaling of the direct solver's computational cost is the same as the nested dissection method, serial implementations of the solution technique are not practical for large scale numerical simulations. This manuscript presents the first parallel implementation of the HPS method. Specifically, we introduce an approach for a sharedmemory implementation of the solution technique utilizing parallel linear algebra. This approach is the foundation for future large scale simulations on supercomputers and clusters with large memory nodes. Performance results on a desktop computer (resembling a large memory node) are presented. (C) 2019 Elsevier Ltd. All rights reserved.
The fast marching method is commonly used in expanding front simulations in various fields, such as, fluid dynamics, computer graphics, and in microelectronics, to restore the signed-distance field property of the lev...
详细信息
The fast marching method is commonly used in expanding front simulations in various fields, such as, fluid dynamics, computer graphics, and in microelectronics, to restore the signed-distance field property of the level-set function, also known as re-distancing. To improve the performance of the re-distancing step, parallel algorithms for the fast marching method as well as support for hierarchical meshes have been developed;the latter to locally support higher resolutions of the simulation domain whilst limiting the impact on the overall computational demand. In this work, the previously developed multi-mesh fast marching method is extended by a so-called block-based decomposition step to improve serial and parallel performance on hierarchical meshes. OpenMP tasks are used for the underlying coarse-grained parallelization on a per mesh basis. The developed approach offers improved load balancing as the algorithm employs a high mesh partitioning degree, enabling to balance mesh partitions with varying mesh sizes. Various benchmarks and parameter studies are performed on representative geometries with varying complexities. The serial performance is increased by up to 21% whereas parallel speedups ranging from 7.4 to 19.1 for various test cases on a 24-core Intel Skylake computing platform have been achieved, effectively doubling the parallel efficiency of the previous approach. (C) 2021 The Author(s). Published by Elsevier B.V.
The fast marching method is well known for its worst-case optimal computational complexity in solving the Eikonal equation, and it has been employed in numerous scientific and engineering fields. However, it has barel...
详细信息
The fast marching method is well known for its worst-case optimal computational complexity in solving the Eikonal equation, and it has been employed in numerous scientific and engineering fields. However, it has barely benefited from fast-advancing multicore and many-core architectures because of the challenges presented by its apparent sequential nature. In this paper, we present a straightforward block-based approach for a highly scalable parallelization of the fast marching method on shared-memory computers. Central to our new algorithm is a simplified restarted narrow band strategy, which is used to replace the global bound for terminating the front marching by an incremental bound that increases by a given stride in each restart. Our algorithm greatly reduces load imbalance among blocks through a synchronized exchanging step after the marching step. Furthermore, simple activation mechanisms are introduced to skip blocks not involved in either step. Thus the new algorithm mainly consists of two loops, each for performing one step on a group of blocks. Notably, it does not contain any data race conditions, which is ideal for a direct loop level parallelization using multiple threads. A systematic performance study is carried out for several point source problems with various speed functions on grids with up to 1024(3) points using two different computers. Substantial parallel speedups, e.g., up to 3-5 times the number of cores on a 16-core/32-thread computer and up to 2 orders of magnitude on a 64-core/256-thread computer, are demonstrated with all cases. As an extra bonus, our new algorithm also gives much improved sequential performance in most scenarios. Detailed pseudocodes are provided to illustrate the modification procedure from the single-block to the multiblock sequential algorithm in a step-by-step manner.
As a typical Gauss-Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss-Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipeli...
详细信息
As a typical Gauss-Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss-Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem (256 x 256 x 256) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss-Seidel like methods with intrinsic strong data dependency.
Petroleum system modeling is a crucial technology to numerically simulate the generation, migration, accumulation, and loss of oil and gas through geologic time. The OpenMP programming paradigm is used to achieve mode...
详细信息
Petroleum system modeling is a crucial technology to numerically simulate the generation, migration, accumulation, and loss of oil and gas through geologic time. The OpenMP programming paradigm is used to achieve modest parallelism on a shared-memory computer for the industrial petroleum system modeling package PetroMod. The significant advantage of this shared-memory parallelization approach is the simplicity of the OpenMP paradigm allowing a smooth transition to a parallel program by incrementally parallelizing a serial code. The process of using OpenMP to parallelize PetroMod is outlined and performance results on a Sun Fire E2900 are reported. (C) 2008 Elsevier Ltd. All rights reserved.
To date, there has been a lack of efficient and practical distributed- and shared-memory parallelizations of the data association problem for multitarget tracking. Filling this gap is one of the primary focuses of the...
详细信息
To date, there has been a lack of efficient and practical distributed- and shared-memory parallelizations of the data association problem for multitarget tracking. Filling this gap is one of the primary focuses of the present work. We begin by describing our data association algorithm in terms of an Interacting Multiple Model (IMM) state estimator embedded into an optimization framework, namely, a two-dimensional (2D) assignment problem (i.e., weighted bipartite matching). Contrary to conventional wisdom, we show that the data association (or optimization) problem is not the major computational bottleneck;instead, the interface to the optimization problem, namely, computing the rather numerous gating tests and IMM state estimates, covariance calculations, and likelihood function evaluations (used as cost coefficients in the 2D assignment problem), is the primary source of the workload. Hence, for both a general-purpose shared-memory MIMD (Multiple Instruction Multiple Data) multiprocessor system and a distributed-memory Intel Paragon high-performance computer, we developed parallelizations of the data association problem that focus on the interface problem. For the former, a coarse-grained dynamic parallelization was developed that realizes excellent performance (i.e., superlinear speedups) independent of numerous factors influencing problem size (e.g., many models in the IMM, denseycluttered environments, contentious target-measurement data, etc.). For the latter, an SPMD (Single Program Multiple Data) parallelization was developed that realizes near-linear speedups using relatively simple dynamic task allocation algorithms. Using a real measurement database based on two FAA air traffic control radars, we show that the parallelizations developed in this work offer great promise in practice.
暂无评论