Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in paralle...
详细信息
Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of these computational resources remains a challenge for several application domains. Many parallel algorithms can scale to only hundreds of cores. The limiting factors of such algorithms are usually communication overhead and poor load balancing. Solving NP-hard graph problems to optimality using exact algorithms is an example of an area in which there has so far been limited success in obtaining large scale parallelism. Many of these algorithms use recursive backtracking as their core solution paradigm. In this paper, we propose a lightweight, easy-to-use, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one. Our approach incurs minimal communication overhead and guarantees a load-balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. We test our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems. On sufficiently hard instances, experimental results show nearly linear speedups for thousands of cores, reducing running times from days to just a few minutes. (C) 2015 Elsevier Inc. All rights reserved.
Numerical reproducibility failures rise in parallel computation because of the non-associativity of floating-point summation. Optimizations on massively parallel systems dynamically modify the floating-point operation...
详细信息
ISBN:
(纸本)9781509057085
Numerical reproducibility failures rise in parallel computation because of the non-associativity of floating-point summation. Optimizations on massively parallel systems dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger operation sequences. Our RARE-BLAS (Reproducible, Accurately Rounded and Efficient BLAS) benefits from recent accurate and efficient summation algorithms. Solutions for level 1 (asum, dot and nrm2) and level 2 (gemv) routines are provided. We compare their performance to the Intel MKL library and to other existing reproducible algorithms. For both shared and distributed memory parallel systems, we exhibit an extra-cost of 2× in the worst case scenario, which is satisfying for a wide range of applications. For Intel Xeon Phi accelerator a larger extra-cost (4× to 6×) is observed, which is still helpful at least for debugging and validation.
This paper presents a parallel algorithm and its hardware architecture for implementing 2-D gray-scale morphological operations namely dilation and erosion using rectangular flat top structuring elements. The proposed...
详细信息
ISBN:
(纸本)9781467366229
This paper presents a parallel algorithm and its hardware architecture for implementing 2-D gray-scale morphological operations namely dilation and erosion using rectangular flat top structuring elements. The proposed architecture supports parallel extension whereby throughput and processing frame rate is enhanced. The architecture is fully generic and runtime programmable with respect to image size and structuring elements size respectively. The main advantage of the architecture is its low latency, lower internal memory requirements, higher processing frame rate and throughput which makes it more amenable to real time applications. Additionally, it makes use of stream processing which eliminates the need for buffering image data, whereby memory overhead is minimized. The architecture has been synthesized using Xilinx Design Suite 14.2 ISE and prototyped on Virtex 5 FPGA Board and verified using xilinx ISIM Simulator. The proposed architecture has been tested for images of varied gray-scale geometric dimension and the results shows satisfactory performance.
In this paper, we consider the computation in parallel of several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are dist...
详细信息
In this paper, we consider the computation in parallel of several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are distributed. Entries are efficiently computed by exploiting sparsity of the right-hand sides and the solution vectors in the triangular solution phase. We demonstrate that in this setting, parallelism and computational efficiency are two contrasting objectives. We develop an efficient approach and show its efficiency on a general purpose parallel multifrontal solver.
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, ...
详细信息
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, that is written in a certain format. It is a distributed dynamic scheduling algorithm, i.e. each node schedules the load of its cores, it can be used with different programming platforms and architectures and is a hybrid algorithm (OpenMP, MPI). To prove its validity and efficiency the proposed algorithm has been implemented and tested with numerous examples in this paper that are described in detail. A speed-up of about 9 has been achieved for the tested examples, for a cluster of three nodes with four cores each.
Using (a, b)-trees as an example, we show how to perform a parallel split with logarithmic latency and parallel join, bulk updates, intersection, union (or merge), and (symmetric) set difference with logarithmic laten...
详细信息
ISBN:
(纸本)9781509054121
Using (a, b)-trees as an example, we show how to perform a parallel split with logarithmic latency and parallel join, bulk updates, intersection, union (or merge), and (symmetric) set difference with logarithmic latency and with information theoretically optimal work. We present both asymptotically optimal solutions and simplified versions that perform well in practice - they are several times faster than previous implementations.
The k-nearest neighbor (kNN) join has recently attracted considerable attention due to its broad applications. However, processing fcNN joins is very expensive due to the quadratic nature of the join operation. Furthe...
详细信息
ISBN:
(纸本)9781467390064
The k-nearest neighbor (kNN) join has recently attracted considerable attention due to its broad applications. However, processing fcNN joins is very expensive due to the quadratic nature of the join operation. Furthermore, since there is an increasing trend of applications to deal with big data, computing fcNN joins becomes more challenging. In order to process such big data, parallel and distributed computing using MapReduce recently have received a lot of attention. In this paper, we propose the efficient parallel algorithm KNN-MR to process the fcNN joins using MapReduce. To reduce not only the computational cost of fcNN joins but also the network cost of communicating across machines, we develop the novel vector projection pruning which enables us to identify non-fcNN points that are guaranteed not to be included in the result of a fcNN join. Our performance study confirms the effectiveness and scalability of the proposed algorithm.
During the last decades, Multigrid methods have been extensively used for solving large sparse linear systems. Considering their efficiency and the convergence behavior, Multigrid methods are used in many scientific f...
详细信息
During the last decades, Multigrid methods have been extensively used for solving large sparse linear systems. Considering their efficiency and the convergence behavior, Multigrid methods are used in many scientific fields as solvers or preconditioners. Herewith, we propose two hybrid parallel algorithms for N-Body simulations using the Particle Mesh method and the Particle Particle Particle Mesh method, respectively, based on the V-Cycle Multigrid method in conjunction with Generic Approximate Sparse Inverses. The N-Body problem resides in a three-dimensional torus space, and the bodies are subject only to gravitational forces. In each time step of the above methods, a large sparse linear system is solved to compute the gravity potential at each nodal point in order to interpolate the solution to each body. Then the Velocity Verlet method is used to compute the new position and velocity from the acceleration of each respective body. Moreover, a parallel Multigrid algorithm, with a truncated approach in the levels computed in parallel, is proposed for solving large linear systems. Furthermore, parallel results are provided indicating the efficiency of the proposed Multigrid N-Body scheme. Theoretical estimates for the complexity of the proposed simulation schemes are provided.
For a given algorithm, the energy consumed in executing the algorithm has a nonlinear relationship with performance. In case of parallel algorithms, energy use and performance are functions of the structure of the alg...
详细信息
For a given algorithm, the energy consumed in executing the algorithm has a nonlinear relationship with performance. In case of parallel algorithms, energy use and performance are functions of the structure of the algorithm. We define the asymptotic energy complexity of algorithms which models the minimum energy required to execute a parallel algorithm for a given execution time as a function of input size. Our methodology provides us with a way of comparing the orders of (minimal) energy required for different algorithms and can be used to define energy complexity classes of parallel algorithms.
Community detection has become an important operation in numerous graph based applications. It is used to reveal groups that exist within real world networks without imposing prior size or cardinality constraints on t...
详细信息
Community detection has become an important operation in numerous graph based applications. It is used to reveal groups that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential, the support for parallel computers is rather limited. This is largely because the algorithm is irregular and the underlying heuristics imply a sequential nature. In this paper I present parallelization heuristics for fast community detection using the Louvain method as it is applied on GPUs. The Louvain method is a multi-phase, iterative heuristic for modularity optimization. It was originally developed by Blondel et al. (2008), the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. The parallel heuristics used, were first introduced by Hao Lu et al. (2015). As the Louvain method is inherently sequential, it limits the possibility of scalable usage. Thanks to the proposed parallel heuristics, I observe how this method can behave on GPUs. For evaluation I implemented the heuristics using CUDA on a GeForce GTX 980M GPU and for testing I used organization landscapes from the CERN developed Collaboration Spotting project that involves patents and publications to visualize the connections in technologies among its collaborators. Compared to the parallel Louvain implementation running on 8 threads on the same machine that has the used GPU, the CUDA implementation is able to produce community outputs comparable to the CPU generated results, while providing absolute speedups of up to 12 using the GeForce GTX 980M mobile GPU.
暂无评论