A new method to construct task graphs for 7c-matrix arithmetic is introduced, which uses the information associated with all tasks of the standard recursive 7c-matrix algorithms, e.g., the block index set of the matri...
详细信息
A new method to construct task graphs for 7c-matrix arithmetic is introduced, which uses the information associated with all tasks of the standard recursive 7c-matrix algorithms, e.g., the block index set of the matrix blocks involved in the computation. Task refinement, i.e., the replacement of tasks by subcomputations, is then used to proceed in the 7c-matrix hierarchy until the matrix blocks containing the actual matrix data are reached. This process is a natural extension of the classical, recursive way in which 7c-matrix arithmetic is defined and thereby simplifies the efficient usage of many-core systems. Numerical examples for model problems with different block structures demonstrate the various properties of the new approach.
We present a (1 + epsilon) -approximate parallel algorithm for computing shortest paths in undirected graphs, achieving poly(log n) depth and mpoly(log n) work for n-nodes m-edges graphs. Although sequential algorithm...
详细信息
ISBN:
(纸本)9781450369794
We present a (1 + epsilon) -approximate parallel algorithm for computing shortest paths in undirected graphs, achieving poly(log n) depth and mpoly(log n) work for n-nodes m-edges graphs. Although sequential algorithms with (nearly) optimal running time have been known for several decades, near-optimal parallel algorithms have turned out to be a much tougher challenge. For (1 + epsilon) -approximation, all prior algorithms with poly(log n) depth perform at least Omega(mn(c)) work for some constant c > 0. Improving this long-standing upper bound obtained by Cohen (STOC'94) has been open for 25 years. We develop several new tools of independent interest. One of them is a new notion beyond hopsets - low hop emulator - a poly(log n)-approximate emulator graph in which every shortest path has at most O(log log n) hops (edges). Direct applications of the low hop emulators are parallel algorithms for poly(log n)-approximate single source shortest path (SSSP), Bourgain's embedding, metric tree embedding, and low diameter decomposition, all with poly(log n) depth and mpoly(log n) work. To boost the approximation ratio to (1 + epsilon), we introduce compressible preconditioners and apply it inside Sherman's framework (SODA'17) to solve the more general problem of uncapacitated minimum cost flow (a.k.a., transshipment problem). Our algorithm computes a (1 + epsilon)-approximate uncapacitated minimum cost flow in poly(log n) depth using mpoly(log n) work. As a consequence, it also improves the state-of-the-art sequential running time from m . 2(O(root log n)) to mpoly(log n).
We investigated three parallel algorithms for a meshless geometric multigrid (GMG)method recently proposed for the linear finite element discretization of elliptic partialdifferential equations. These methods are base...
详细信息
The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than k left. Th...
详细信息
ISBN:
(纸本)9781450328210
The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than k left. The remaining hypergraph is known as the k-core. In this paper, we analyze parallel peeling processes, where in each round, all vertices of degree less than k are removed. It is known that, below a specific edge density threshold, the k-core is empty with high probability. We show that, with high probability, below this threshold, only 1/log ((k-1)(r-1)) log logn + O(1) rounds of peeling are needed to obtain the empty k-core for r-uniform hypergraphs. Interestingly, we show that above this threshold, Omega(logn) rounds of peeling are required to find the non-empty k-core. Since most algorithms and data structures aim to peel to an empty kcore, this asymmetry appears fortunate. We verify the theoretical results both with simulation and with a parallel implementation using graphics processing units (GPUs). Our implementation provides insights into how to structure parallel peeling algorithms for efficiency in practice.
A novel parallel solver based on the adaptive integral method (AIM) is proposed for the electromagnetic analysis of electrical interconnects in layered media. We show that graph partitioning techniques can be used to ...
详细信息
ISBN:
(纸本)9781728161617
A novel parallel solver based on the adaptive integral method (AIM) is proposed for the electromagnetic analysis of electrical interconnects in layered media. We show that graph partitioning techniques can be used to optimally distribute, across thousands of processes, the computations related to both matrix filling and system solution. The proposed workload distribution strategy is compared to existing techniques through a scalability study on a large realistic interposer model in layered media.
In this article we consider parallel numerical algorithms to solve the 3D mathematical model, that describes a wave propagation in rectangular waveguide. The main goal is to formulate and analyze a minimal algorithmic...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
In this article we consider parallel numerical algorithms to solve the 3D mathematical model, that describes a wave propagation in rectangular waveguide. The main goal is to formulate and analyze a minimal algorithmic template to solve this problem by using the CUDA platform. This template is based on explicit finite difference schemes obtained after approximation of systems of differential equations on the staggered grid. The parallelization of the discrete algorithm is based on the domain decomposition method. The theoretical complexity model is derived and the scalability of the parallel algorithm is investigated. Results of numerical simulations are presented.
We present the first parallel fixed-parameter algorithm for subgraph isomorphism in planar graphs, bounded-genus graphs, and, more generally, all minor-closed graphs of locally bounded treewidth. Our randomized low de...
详细信息
ISBN:
(纸本)9781450369350
We present the first parallel fixed-parameter algorithm for subgraph isomorphism in planar graphs, bounded-genus graphs, and, more generally, all minor-closed graphs of locally bounded treewidth. Our randomized low depth algorithm has a near-linear work dependency on the size of the target graph. Existing low depth algorithms do not guarantee that the work remains asymptotically the same for any constant-sized pattern. By using a connection to certain separating cycles, our subgraph isomorphism algorithm can decide the vertex connectivity of a planar graph (with high probability) in asymptotically near-linear work and poly-logarithmic depth. Previously, no sub-quadratic work and poly-logarithmic depth bound was known in planar graphs (in particular for distinguishing between four-connected and five-connected planar graphs).
Aiming at the problem of poor control stability of traditional aircraft control systems, the longitudinal anti-disturbance control system based on the distributed parallel algorithm was designed. Based on the hardware...
详细信息
ISBN:
(数字)9781728164793
ISBN:
(纸本)9781728164793
Aiming at the problem of poor control stability of traditional aircraft control systems, the longitudinal anti-disturbance control system based on the distributed parallel algorithm was designed. Based on the hardware of the original control system, the anti-disturbance control system was designed. And the software part of the aircraft longitudinal anti-disturbance control system was designed. The longitudinal model of aircraft was established, and the active disturbance rejection controller (ADRC) was also designed according to the model. Through the use of distributed parallel algorithms to set the parameters of ADRC, thus completing the design of the vertical ADRC system The comparison experiment with the traditional PD-based aircraft control system shows that the design of the control system based on distributed parallel algorithm has the characteristics of less overshoot, good stability and broad application prospects.
We present a cache-efficient parallel algorithm for the sequence alignment with gap penalty problem for shared-memory machines using multiway divide-and-conquer and not-in-place matrix transposition. Our r-way divide-...
详细信息
We present a cache-efficient parallel algorithm for the sequence alignment with gap penalty problem for shared-memory machines using multiway divide-and-conquer and not-in-place matrix transposition. Our r-way divide-and-conquer algorithm, for a fixed natural number r >= 2, performs Theta (n(3)) work, achieves Theta (n(logr(2r-1))) span, and incurs O(n(3)/(BM) + (n(2)/B)log root M) serial cache misses for n > gamma M, and incurs O ((n(2)/B)log(n/root M)) serial cache misses for alpha root M < n <= gamma M, where, M is the cache size, B is the cache line size, and alpha and gamma are constants. Published by Elsevier B.V.
This paper describes a new technique for parallelizing protein clustering, an important bioinformatics computation for the analysis of protein sequences. Protein clustering identifies groups of proteins that are simil...
详细信息
ISBN:
(纸本)9781450380751
This paper describes a new technique for parallelizing protein clustering, an important bioinformatics computation for the analysis of protein sequences. Protein clustering identifies groups of proteins that are similar because they share long sequences of similar amino acids. Given a collection of protein sequences, clustering can significantly reduce the computational effort required to identify all similar sequences by avoiding many negative comparisons. The challenge, however, is to build a clustering that misses as few similar sequences (or elements, more generally) as possible. In this paper, we introduce precise clustering, a property that requires each pair of similar elements to appear together in at least one cluster. We show that transitivity in the data can be leveraged to merge clusters while maintaining a precise clustering, providing a basis for independently forming clusters. This allows us reformulate clustering as a bottom-up merge of independent clusters in a new algorithm called ClusterMerge. ClusterMerge exposes parallelism, enabling fast and scalable implementations. We apply ClusterMerge to find similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O(n(2)) comparison, with only half as many comparisons. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604 times on 768 cores (1400 times faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.
暂无评论