Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed...
详细信息
ISBN:
(纸本)9780897919890
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S* that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S+ improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450 MHz nodes, which is the highest performance reported in the literature.
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Block-wise access to data is a central the...
详细信息
ISBN:
(纸本)9780897919890
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Block-wise access to data is a central theme in the design of efficient EM algorithms. A similar requirement arises in the transmission of data between processors in certain parallel computation models such as BSP* and CGM, for which block-wise communication is a crucial issue. We consider multisearch problems, where a large number of queries are to be simultaneously processed and satisfied by navigating through large data structures on parallel computers. The examples used originate as BSP* algorithms and we adapt them to the EM situation where the queries and data structure are considered to be much larger than the size of the available internal memory. This paper presents techniques to achieve blocking for I/O as well as for communication in multisearch on the BSP* and EM-BSP* models. In the area of EM algorithms new algorithms for multisearch in balanced trees are described. For search trees up to size O(n log n) where n is the number of queries, we obtain work-optimal, parallel, EM multisearch algorithms whose I/O and communication time are the same, asymptotically, as the computation time. These algorithms are obtained via the simulation technique of [15]. For larger trees we describe a parallel, EM algorithm which is simultaneously c-optimal and I/O-optimal. We give a lower bound to the number of I/O operations required for filtering n queries through a binary or multiway search tree of size m when m ≥ n2+Ε, constant Ε > 0. In the domain of parallel, non-EM algorithms we describe a new 1-optimal algorithm for BSP* multisearch in optimal trees which communicates in a block-wise fashion.
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first non-trivial lower bounds on the communication volume required to sol...
详细信息
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first non-trivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case a total of Ω(k·min(m, pn)) bits need to be transmitted in order to solve the MST problem, where k is the number of bits required to represent a single edge weight. This implies that if each message contains k bits, any BSP algorithm for finding an MST requires communication time Ω(g · min(m/p, n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms whose running times match the lower bounds in different situations. Both algorithms perform linear work and use a number of super-steps independent of the input size. The first algorithm is simple but can employ at most m/n processors efficiently. Hence, it should be applied in situations where the input graph is relatively dense. The second algorithm is a randomized algorithm that performs linear work with high probability, provided that m ≥ n · log p. This is the first linear work BSP algorithm for finding an MST in sparse graphs.
We consider the following scheduling problem: Our goal is to execute a given amount of arbitrarily decomposable work on a distributed machine as quickly as possible. The work is maintained by a central scheduler that ...
详细信息
ISBN:
(纸本)9780897919890
We consider the following scheduling problem: Our goal is to execute a given amount of arbitrarily decomposable work on a distributed machine as quickly as possible. The work is maintained by a central scheduler that can assign chunks of work of an arbitrary size to idle processors. The difficulty is that the processing time required for a chunk is not exactly predictable - usually the less, the larger the chunk - and that processors suffer a delay for each assignment. Our objective is to minimize the total wasted time of the schedule, that is, the sum of all delays plus the idle times of processors waiting for the last processor to finish. We introduce a new deterministic model for this setting, based on estimated ranges [α(w), β(w)] for processing times of chunks of size w. Depending on α, β, and a measure for the overall deviation from these estimates, we can prove matching upper and lower bounds on the wasted time, the former being achieved by our new balancing strategy. This is in sharp contrast with previous work that, even under the strong assumption of independent, approximately normally distributed chunk processing times, proposed only heuristic scheduling schemes supported merely by empirical evidence. Our model naturally subsumes this stochastic setting, and our generic analysis is valid for most of the existing schemes too, proving them to be non-optimal.
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes th...
详细信息
ISBN:
(纸本)9780897919890
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple 'determinacy' races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in. O(nkT α(V, V)) time and O(nkV) space, where α is Tarjan's functional inverse of Ackermann's function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the 'umbrella' locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT α(V, V)) time using O(kV) space. We also prove that any 'abelian' Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is data-race free. Thus, the Nondeterminator-2's two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.
In this paper we consider the problem of deadlock-free routing in arbitrary parallel and distributed computers. We focus on asynchronous routing algorithms which continuously receive new packets to route and which do ...
详细信息
ISBN:
(纸本)9780897919890
In this paper we consider the problem of deadlock-free routing in arbitrary parallel and distributed computers. We focus on asynchronous routing algorithms which continuously receive new packets to route and which do not discard packets that encounter congestion. Specifically, we examine what we call the deadlock-free routing (DFR) problem. The input to the DFR problem consists of an arbitrary network and an arbitrary set of paths in the network. The output consists of a routing algorithm, which is a list of the buffers used along each of the paths. The routing algorithm is required to be free from deadlock and the goal is to minimize the number of buffers required in any one node. We study the DFR problem by converting it into an equivalent problem which we call the flattest common supersequence (FCS) problem. The input to the FCS problem consists of a set of sequences and the output consists of a single sequence that contains all of the input sequences as (possibly non-contiguous) subsequences. The goal of the FCS problem is to minimize the maximum frequency of any symbol in the output sequence. We present three main results. First, we prove that the decision version of the FCS problem is NP-complete, and has no polynomial-time approximation scheme unless P = NP. Next, we propose and experimentally evaluate a range of heuristics for FCS. Our experimental results show that one of these heuristics performs very well over a wide range of inputs. Lastly, we prove that this heuristic is in fact optimal for certain restricted classes of inputs.
We give an alphabet-independent deterministic parallel algorithm for finding all occurrences of a pattern array of size m(h) x m(w) in a text array of size n(h) x n(w) in the concurrent-read-concurrent-write-parallel-...
详细信息
We give an alphabet-independent deterministic parallel algorithm for finding all occurrences of a pattern array of size m(h) x m(w) in a text array of size n(h) x n(w) in the concurrent-read-concurrent-write-parallel-random-access-machine (CRCW-PRAM) model. Our algorithm runs in O(1) time performing optimal, that is, O(n(h) x n(w)) work, following preprocessing of the pattern. This improves the previous best bound of O(loglogm) time with optimal work [A. Amir, G. Benson, and M. Farach, proceedings 5th annualacmsymposium on parallelalgorithms and architectures, acm, New York, 1993, pp. 79-85], following preprocessing of the pattern, where m = max {m(h),m(w)}. The preprocessing required by our algorithm (and that due to Amir, Benson, and Farach) can be accomplished in O(loglogm) time and O(m(h) x m(w)) work [M. Crochemore et al., manuscript, 1993], [R. Cole et al., manuscript, 1993].
暂无评论