Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SC...
详细信息
Computing strongly connected components (SCC) is among the most fundamental problems in graph analytics. Given the large size of today's real-world graphs, parallel SCC implementation is increasingly important. SCC is challenging in the parallel setting and is particularly hard on large-diameter graphs. Many existing parallel SCC implementations can be even slower than Tarjan's sequential algorithm on large-diameter *** tackle this challenge, we propose an efficient parallel SCC implementation using a new parallel reachability approach. Our solution is based on a novel idea referred to as vertical granularity control (VGC). It breaks the synchronization barriers to increase parallelism and hide scheduling overhead. To use VGC in our SCC algorithm, we also design an efficient data structure called the parallel hash bag. It uses parallel dynamic resizing to avoid redundant work in maintaining frontiers (vertices processed in a round).We implement the parallel SCC algorithm by Blelloch et al. (J. ACM, 2020) using our new parallel reachability approach. We compare our implementation to the state-of-the-art systems, including GBBS, iSpan, Multi-step, and our highly optimized Tarjan's (sequential) algorithm, on 18 graphs, including social, web, k-NN, and lattice graphs. On a machine with 96 cores, our implementation is the fastest on 16 out of 18 graphs. On average (geometric means) over all graphs, our SCC is 6.0× faster than the best previous parallel code (GBBS), 12.8× faster than Tarjan's sequential algorithms, and 2.7× faster than the best existing implementation on each *** believe that our techniques are of independent interest. We also apply our parallel hash bag and VGC scheme to other graph problems, including connectivity and least-element lists (LE-lists). Our implementations improve the performance of the state-of-the-art parallel implementations for these two problems.
parallel Givens sequences for solving the General Linear Model (GLM) are developed and analyzed. The block updating GLM estimation problem is also considered. The solution of the GLM employs as a main computational de...
详细信息
parallel Givens sequences for solving the General Linear Model (GLM) are developed and analyzed. The block updating GLM estimation problem is also considered. The solution of the GLM employs as a main computational device the Generalized QR Decomposition, where one of the two matrices is initially upper triangular. The proposed Givens sequences efficiently exploit the initial triangular structure of the matrix and special properties of the solution method. The complexity analysis of the sequences is based on a Exclusive Read-Exclusive Write (EREW) parallel Random Access Machine (PRAM) model with limited parallelism. Furthermore, the number of operations performed by a Givens rotation is determined by the size of the vectors used in the rotation. With these assumptions one conclusion drawn is that a sequence which applies the smallest number of compound disjoint Givens rotations to solve the GLM estimation problem does not necessarily have the lowest computational complexity. The various Givens sequences and their computational complexity analyses will be useful when addressing the solution of other similar factorization problems.
Solving the deterministic equivalent formulation of two-stage stochastic programs using interior point algorithms requires the solution of linear systems of the form(AD<span style="position: absolute; top: -4....
详细信息
Solving the deterministic equivalent formulation of two-stage stochastic programs using interior point algorithms requires the solution of linear systems of the form
Although granular materials have always been an important part of our everyday life, their characteristics and behavior is still only rudimentally understood. Therefore the numerical simulation has gained an increasin...
详细信息
Although granular materials have always been an important part of our everyday life, their characteristics and behavior is still only rudimentally understood. Therefore the numerical simulation has gained an increasing importance to gain deeper insight into the properties of granular media. One simulation approach is rigid body dynamics. In contrast to particle-based approaches, it fully resolves the granular particles as geometric objects and incorporates frictional contact dynamics. However, due to its complexity and the lack of large-scale parallelization, rigid body dynamics so far could not be used for very large simulation scenarios. In this paper we demonstrate massively parallel granular media simulations by means of a parallel rigid body dynamics algorithm. We will validate the algorithm for granular gas simulations and prove its scalability on up to 131 072 processor cores. Additionally, we will show several parallel granular material simulations both with spherical and non-spherical granular particles.
We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating...
详细信息
ISBN:
(纸本)9781450362177
We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating weighted matchings in the massively parallel computation (MPC) model and in the streaming model. For the MPC and the multi-pass streaming model, we show that any algorithm computing a (1- delta)-approximate unweighted matching in bipartite graphs can be translated into an algorithm that computes a (1 - epsilon(delta))-approximate maximum weighted matching. Furthermore, this translation incurs only a constant factor (that depends on epsilon > 0) overhead in the complexity. Instantiating this with the current best MPC algorithm for unweighted matching yields a (1 - epsilon)-approximation algorithm for maximum weighted matching that uses O-epsilon (log logn) rounds, O(m/n) machines per round, and Oe (n poly(logn)) memory per machine. This improves upon the previous best approximation guarantee of (1/2 - epsilon) for weighted graphs. In the context of single-pass streaming with random edge arrivals, our techniques yield a (1/2 + c)-approximation algorithm thus breaking the natural barrier of 1/2.
We propose a parallel preconditioner for the Newton method in the computation of the leftmost eigenpairs of large and sparse symmetric positive definite matrices. A sequence of preconditioners starting from an enhance...
详细信息
We propose a parallel preconditioner for the Newton method in the computation of the leftmost eigenpairs of large and sparse symmetric positive definite matrices. A sequence of preconditioners starting from an enhanced approximate inverse RFSAI (Bergamaschi and Martinez, 2012) and enriched by a BFGS-like update formula is proposed to accelerate the preconditioned conjugate gradient solution of the linearized Newton system to solve Au =q (u)u, q(u) being the Rayleigh quotient. In a previous work (Bergamaschi and Martinez, 2013) the sequence of preconditioned Jacobians is proven to remain close to the identity matrix if the initial preconditioned Jacobian is so. Numerical results onto matrices arising from various realistic problems with size up to 1.5 million unknowns account for the efficiency and the scalability of the proposed low rank update of the RFSAI preconditioner. The overall RFSAI-BFGS preconditioned Newton algorithm has shown comparable efficiencies with a well-established eigenvalue solver on all the test problems.
The compensation of scale factor imposes significant computation overhead on the CORDIC algorithm. In this paper we present two algorithms and the corresponding architectures (one for both rotation and vectoring modes...
详细信息
The compensation of scale factor imposes significant computation overhead on the CORDIC algorithm. In this paper we present two algorithms and the corresponding architectures (one for both rotation and vectoring modes and the other only for rotation mode) to perform the scaling factor compensation in parallel with the classical CORDIC iterations. With these methods, the scale factor compensation overhead is reduced to a couple of iterations for any word length. The architectures presented have been optimized for conventional and redundant arithmetic.
The paper recalls the period 1988-1993 when the research on parallel algorithms and their implementation started in Karl-Marx-Stadt (renamed to Chemnitz in 1990).We consider the research group formed at this time and ...
详细信息
The purpose of this short note is to show that the problem of reconstructing a directed forest from a collection of leaf-to-root paths can be done efficiently in parallel by reducing the problem to integer sorting. Sp...
详细信息
The purpose of this short note is to show that the problem of reconstructing a directed forest from a collection of leaf-to-root paths can be done efficiently in parallel by reducing the problem to integer sorting. Specifically, given M the total length of the paths in the collection, and n the number of distinct node labels, our algorithm reconstructs the corresponding forest (if such a forest exists) in O(M/p) time using p ≤ M/n processors or time using M/n < p < M processors, and O (M) space on the EREW-PRAM.
Genetic programming (GP) has been applied to image classification and achieved promising results. However, most GP-based image classification methods are only applied to small-scale image datasets because of the limit...
详细信息
ISBN:
(纸本)9781450392686
Genetic programming (GP) has been applied to image classification and achieved promising results. However, most GP-based image classification methods are only applied to small-scale image datasets because of the limits of high computation cost. Efficient acceleration technology is needed when extending GP-based image classification methods to large-scale datasets. Considering that fitness evaluation is the most time-consuming phase of the GP evolution process and is a highly parallelized process, this paper proposes a CPU multi-processing and GPU parallel approach to perform the process, and thus effectively accelerate GP for image classification. Through various experiments, the results show that the highly parallelized approach can significantly accelerate GP-based image classification without performance degradation. The training time of GP-based image classification method is reduced from several weeks to tens of hours, enabling it to be run on large-scale image datasets.
暂无评论