We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matri...
详细信息
For many years, computer scientists have explored the computing power of so-called computing clusters to address performance requirements of computationally intensive tasks. Historically, computing clusters have been ...
详细信息
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made...
详细信息
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let D be a collection of n sequences of total length N, phi be a length threshold, and k be a mismatch threshold. The goal is to identify and report all k-mismatch maximal common substrings of length at least phi over all pairs of strings in D. Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of O(N log(k) N+occ), where occ is the output size. We then present a distributed memory parallel algorithm with an expected run time of O ((N/P log N + occ) log(k) N) using O (log(k+1) N) expected rounds of global communications, under some realistic assumptions, where p is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data. (C) 2020 Elsevier Inc. All rights reserved.
This paper presents an O(log log d¯) round massively parallel algorithm for 1 + ǫ approximation of maximum weighted b-matchings, using near-linear memory per machine. Here d¯ denotes the average degree in th...
详细信息
There are many questions about the statistical properties of random graphs, particularly those related to cyclic structures. However, theoretical advances have been made in the sparse connection regime. Recent results...
详细信息
ISBN:
(纸本)9781665456753
There are many questions about the statistical properties of random graphs, particularly those related to cyclic structures. However, theoretical advances have been made in the sparse connection regime. Recent results on the Kahn-Kalai conjecture show that there is a limiting connection probability beyond which it's very likely to find Hamiltonian cycles. It is shown that this probability is $P \sim log(n)/n$ where $n$ is the number of nodes. We explore experimentally around this limit by showing its empirical statistical behavior. These results are useful in configuring various engineering problems based on sparse graphs.
Evaluating how well a whole system or set of subsystems performs is one of the primary objectives of performance testing. We can tell via performance assessment if the architecture implementation meets the design obje...
详细信息
The paper proposes dynamic parallel algorithms for connectivity and bipartiteness of undirected graphs that require constant time and O(n1/2+ϵ) work on the CRCW PRAM model. The work of these algorithms almost matches ...
详细信息
Existing work-efficient parallel algorithms for floating-point prefix sums exhibit either good performance or good numerical accuracy, but not both. Consequently, prefix-sum algorithms cannot easily be used in scienti...
详细信息
ISBN:
(纸本)9781728192192
Existing work-efficient parallel algorithms for floating-point prefix sums exhibit either good performance or good numerical accuracy, but not both. Consequently, prefix-sum algorithms cannot easily be used in scientific-computing applications that require both high performance and accuracy. We have designed and implemented two new algorithms, called CAST_BLK and PAIR_BLK, whose accuracy is significantly higher than that of the high-performing prefix-sum algorithm from the Problem Based Benchmark Suite, while running with comparable performance on modern multicore machines. Specifically, the root mean squared error of the PBBS code on a large array of uniformly distributed 64-bit floating-point numbers is 8 times higher than that of CAST_BLK and 5.8 times higher than that of PAIR_BLK. These two codes employ the PBBS three-stage strategy for performance, but they are designed to achieve high accuracy, both theoretically and in practice. A vectorization enhancement to these two scalar codes trades off a small amount of accuracy to match or outperform the PBBS code while still maintaining lower error.
Given a trace of a distributed computation and a desired predicate, the predicate detection problem is to find a consistent global state that satisfies the given predicate. The predicate detection problem has many app...
详细信息
ISBN:
(纸本)9781450360944
Given a trace of a distributed computation and a desired predicate, the predicate detection problem is to find a consistent global state that satisfies the given predicate. The predicate detection problem has many applications in the testing and runtime verification of parallel and distributed systems. We show that many problems related to predicate detection are in the parallel complexity class NC, the set of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors. Given a computation on n processes with at most m local states per process, our parallel algorithm to detect a given conjunctive predicate takes O(log mn) time and O(m(3)n(3) log mn) work. The sequential algorithm takes O(mn(2)) time. For data race detection, we give a parallel algorithm that takes O(logmn log n) time, also placing that problem in NC. This is the first work, to the best of our knowledge, that places the parallel complexity of such predicate detection problems in the class NC.
We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in ...
详细信息
ISBN:
(纸本)9781450362955
We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.
暂无评论