We present an efficient parallel algorithm for the following problem: Given an input collection D of n sequences of total length N, a length threshold f and a mismatch threshold κ, report all κ-mismatch maximal comm...
详细信息
ISBN:
(纸本)9781467388160
We present an efficient parallel algorithm for the following problem: Given an input collection D of n sequences of total length N, a length threshold f and a mismatch threshold κ, report all κ-mismatch maximal common substrings of length at least f over all pairs of strings in D. This problem is motivated by clustering and assembly applications in computational biology, where D is a collection of millions of short DNA sequences. Sequencing errors and massive size of these datasets necessitate efficient parallel approximate sequence matching algorithms. We present a novel distributed memory parallel algorithm that solves this approximate sequence matching problem in O ((N/p log N + occ)log k N) expected time and takes only O(log k+1 N) expected rounds of global communications, under some realistic assumptions, where p is the number of processors and occ is the output size. To our knowledge, this is the first provably sub-quadratic time algorithm for solving this problem. We demonstrate the performance and scalability of our algorithm using large high throughput sequencing data sets.
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction a...
详细信息
ISBN:
(纸本)9781509042982
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction algorithms. The Singular Value Decomposition (SVD) is a data reduction technique that has been used in many different applications. For example, SVDs have been extensively used in text analysis. The best known sequential algorithms for the computation of SVDs take cubic time which may not be acceptable in practice. As a result, many parallel algorithms have been proposed in the literature. There are two kinds of algorithms for SVD, namely, QR decomposition and Jacobi iterations. Researchers have found out that even though QR is sequentially faster than Jacobi iterations, QR is difficult to parallelize. As a result, most of the parallel algorithms in the literature are based on Jacobi iterations. For example, the Jacobi Relaxation Scheme (JRS) of the classical Jacobi algorithm has been shown to be very effective in parallel. In this paper we propose a novel variant of the classical Jacobi algorithm that is more efficient than the JRS algorithm. Our experimental results confirm this assertion. The key idea behind our algorithm is to select the pivot elements for each sweep appropriately. We also show how to efficiently implement our algorithm on such parallel models as the PRAM and the mesh.
parallel graph algorithms continue to attract a lot of research attention given their applications to several fields of sciences and engineering. Efficient design and implementation of graph algorithms on modern manyc...
详细信息
ISBN:
(纸本)9781509054121
parallel graph algorithms continue to attract a lot of research attention given their applications to several fields of sciences and engineering. Efficient design and implementation of graph algorithms on modern manycore accelerators has to however contend with a host of challenges including not being able to reach full memory system throughput and irregularity. Of late, focusing on real-world graphs, researchers are addressing these challenges by using decomposition and preprocessing techniques guided by the structural properties of such graphs. In this direction, we present a new GPU algorithm for obtaining an ear decomposition of a graph. Our implementation of the proposed algorithm on an NVidia Tesla K40c improves the state-of-the-art by a factor of 2.3x on average on a collection of real-world and synthetic graphs. The improved performance of our algorithm is due to our proposed characterization that identifies edges of the graph as redundant for the purposes of an ear decomposition. We then study an application of the ear decomposition of a graph in computing the betweenness-centrality values of nodes in the graph. We use an ear decomposition of the input graph to systematically remove nodes of degree two. The actual computation of betweenness-centrality is done on the remaining nodes and the results are extended to nodes removed in the previous step. We show that this approach improves the state-of-the-art for computing betweenness-centrality on an NVidia K40c GPU by a factor of 1.9x on average over a collection of real-world graphs.
We explore the benefits of using online-autotuning to find an optimal configuration for the parallel construction of Surface Area Heuristic (SAH) kD-trees. Using a quickly converging autotuning mechanism, we achieve a...
详细信息
ISBN:
(纸本)9781509021413
We explore the benefits of using online-autotuning to find an optimal configuration for the parallel construction of Surface Area Heuristic (SAH) kD-trees. Using a quickly converging autotuning mechanism, we achieve a significant performance improvement of up to 1.96x. The SAH kD-tree is a spatial data structure and a fundamental tool in the domain of computer graphics and simulations. The parallel construction of these trees is influenced by several parameters, controlling various aspects of the algorithm. However, the parameter configurations advocated in the literature are hardly ever portable. To boost portability, we apply online-autotuning to four state-of-the-art variants of parallel kD-tree construction. We show that speedups over the variants' standard configurations are possible with low programmer effort. We further demonstrate the performance portability of our approach by evaluating performance on varying multicore platforms and both static and dynamic geometries.
Modularization is a significant way to ease the challenge of evaluating large-scale fault trees known as a NP-hard problem, especially for BDD algorithm. In our previous work, we proposed an effective parallel modular...
详细信息
ISBN:
(纸本)9781509002504
Modularization is a significant way to ease the challenge of evaluating large-scale fault trees known as a NP-hard problem, especially for BDD algorithm. In our previous work, we proposed an effective parallel modularization algorithm for large-scale coherent fault trees which only include gates with AND logic and OR logic. But in real engineering scenarios, fault trees usually consist of many complex static logic gates (i.e. K out of N gate, NOT gate, etc.) and dynamic logic gates (i.e. sequence gate, functional dependency gate, etc.). To address these issues, we improve the parallel modularization algorithm and discover its feasibility to confront all sorts of fault trees. Generally, K out of N gate is usually transformed into the combination of AND gates and OR gates, but that will significant increase the scale of the tree and consume more computation resources. As a solution, we treat the K out of N gate as a single module which could be encoded by BDD directly, and then the scale of entire fault tree could be controlled. Besides, we take the non-coherent fault tree and dynamic fault tree into account and find that these complicated logic gates can be regarded as modules too. We can draw the conclusion that our algorithm can be applied to all sorts of fault trees. Moreover, a detailed improvement of the algorithm is that we only start the algorithm from the bottom events with more than one outgoing edges, and that means we need not to traverse all nodes of the large fault tree. It is meaningful to save the computation resources. In the experiment section, we compare computation time of the improving algorithm with our previous works. And illustrates that the superiority of the improving algorithm through some graphs in various aspects.
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel ma...
详细信息
ISBN:
(纸本)9781509021413
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants both theoretically and in practice for sparse-dense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.
In this paper, we present our Concurrent Systems class, where parallel programming and parallel and distributed computing (PDC) concepts have been taught for more than 20 years. Despite several rounds of changes in ha...
详细信息
In this paper, we present our Concurrent Systems class, where parallel programming and parallel and distributed computing (PDC) concepts have been taught for more than 20 years. Despite several rounds of changes in hardware, the class maintains its goals of allowing students to learn parallel computer organizations, studying parallel algorithms, and writing code to be able to run on parallel and distributed platforms. We discuss the benefits of such a class, reveal the key elements in developing this class and receiving funding to replace outdated hardware. We will also share our activities in attracting more students to be interested in PDC and related topics.
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, ...
详细信息
In this paper a parallel algorithm for branch and bound applications is proposed. The algorithm is a general purpose one and it can be used to parallelize effortlessly any sequential branch and bound style algorithm, that is written in a certain format. It is a distributed dynamic scheduling algorithm, i.e. each node schedules the load of its cores, it can be used with different programming platforms and architectures and is a hybrid algorithm (OpenMP, MPI). To prove its validity and efficiency the proposed algorithm has been implemented and tested with numerous examples in this paper that are described in detail. A speed-up of about 9 has been achieved for the tested examples, for a cluster of three nodes with four cores each.
Numerical reproducibility failures rise in parallel computation because of the non-associativity of floating-point summation. Optimizations on massively parallel systems dynamically modify the floating-point operation...
详细信息
ISBN:
(纸本)9781509057085
Numerical reproducibility failures rise in parallel computation because of the non-associativity of floating-point summation. Optimizations on massively parallel systems dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger operation sequences. Our RARE-BLAS (Reproducible, Accurately Rounded and Efficient BLAS) benefits from recent accurate and efficient summation algorithms. Solutions for level 1 (asum, dot and nrm2) and level 2 (gemv) routines are provided. We compare their performance to the Intel MKL library and to other existing reproducible algorithms. For both shared and distributed memory parallel systems, we exhibit an extra-cost of 2× in the worst case scenario, which is satisfying for a wide range of applications. For Intel Xeon Phi accelerator a larger extra-cost (4× to 6×) is observed, which is still helpful at least for debugging and validation.
Using (a, b)-trees as an example, we show how to perform a parallel split with logarithmic latency and parallel join, bulk updates, intersection, union (or merge), and (symmetric) set difference with logarithmic laten...
详细信息
ISBN:
(纸本)9781509054121
Using (a, b)-trees as an example, we show how to perform a parallel split with logarithmic latency and parallel join, bulk updates, intersection, union (or merge), and (symmetric) set difference with logarithmic latency and with information theoretically optimal work. We present both asymptotically optimal solutions and simplified versions that perform well in practice - they are several times faster than previous implementations.
暂无评论