Multithreading aims to tolerate latency by overlapping communication with computation. This report explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. Th...
详细信息
ISBN:
(纸本)9780897918909
Multithreading aims to tolerate latency by overlapping communication with computation. This report explicates the multithreading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hardware supports for fine-grain multithreading, including a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed-sized communication packets. Bitonic sorting and Fast Fourier Transform are selected for experiments. Parameters that characterize the performance of multithreading are investigated, including the number of threads, the number of thread switches, the run length, and the number of remote reads. Experimental results indicate that the best communication performance occurs when the number of threads is two to four. FFT yielded over 95% overlapping due to a large amount of computation and communication parallelism across threads. Even in the absence of thread computation parallelism, multithreading helps overlap over 35% of the communication time for bitonic sorting.
We show that the basic semidefinite programming relaxation value of any constraint satisfaction problem can be computed in NC;that is, in parallel polylogarithmic time and polynomial work. As a complexity-theoretic co...
详细信息
ISBN:
(纸本)9781450345934
We show that the basic semidefinite programming relaxation value of any constraint satisfaction problem can be computed in NC;that is, in parallel polylogarithmic time and polynomial work. As a complexity-theoretic consequence we get that MIP1[k, c, s] subset of PSPACE provided s/c <= (.62 -o(1))k/2(k), resolving a question of Austrin, Hastad, and Pass. Here MIP1[k, c, s] is the class of languages decidable with completeness c and soundness s by an interactive proof system with k provers, each constrained to communicate just 1 bit.
Motivated by the significantly higher cost of writing than reading in emerging memory technologies, we consider parallel algorithm design under such asymmetric read-write costs, with the goal of reducing the number of...
详细信息
Previous implementations of out-of-core columnsort limit the problem size to N ≤ √(M/P)3/2, where N is the number of records to sort, P is the number of processors, and M is the total number of records that the enti...
详细信息
ISBN:
(纸本)9781581136616
Previous implementations of out-of-core columnsort limit the problem size to N ≤ √(M/P)3/2, where N is the number of records to sort, P is the number of processors, and M is the total number of records that the entire system can hold in its memory. We implemented two variations to out-of-core columnsort that relax this restriction. Sub-block columnsort is based on an algorithmic modification of the underlying columnsort algorithm, and it improves the problem-size bound to N ≤ (M/P)5/3/42/3 but at the cost of additional disk I/O. M-columnsort changes the notion of the column size in columnsort, improving the maximum problem size to N ≤ √/M3/2 but at the cost of additional computation and communication. Experimental results on a Beowulf cluster show that both subblock columnsort and M-columnsort run well but that M-columnsort is faster. A further advantage of M-columnsort is that it handles a wider range of problem sizes than subblock columnsort.
The overall efficiency of parallelalgorithms is most decisively effected by the strategy applied for the mapping of workload. Strategies for balancing dynamically generated workload on a processor network which are a...
详细信息
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed...
详细信息
ISBN:
(纸本)9780897919890
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S* that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S+ improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450 MHz nodes, which is the highest performance reported in the literature.
We present the first work-optimal polylogarithmic-depth parallel algorithm for the minimum cut problem on non-sparse graphs. For ≥ n^1+ϵ for any constant ϵ>0, our algorithm requires O(m log n) work and O(log^3 n) ...
详细信息
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distribut...
详细信息
ISBN:
(纸本)089791483X
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitter-directed routing. In this paper we present B-Flashsort, a new batched-routing variant of Flashsort designed to sort N > P values using P processors connected in a d-dimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of B-Flashsort over Flashsort is that it replaces pipelined splitter-directed routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of B-Flashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a d-dimensional toroidal mesh B-Flashsort improves on Samplesort when (N/P) 1log P + c2dP1/d + c3), for machine-dependent parameters c1, c2, and c3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP-1 of Samplesort and two B-Flashsort variants.
We study the problem of storing an ordered E;et On an asynchronous shared memory parallel computer. We examine the case where we want to perform successor (least upper bound) queries efficiently on the set members tha...
详细信息
ISBN:
(纸本)9780897918091
We study the problem of storing an ordered E;et On an asynchronous shared memory parallel computer. We examine the case where we want to perform successor (least upper bound) queries efficiently on the set members that are stored. We also examine the case where processors insert and delete members of the set. Due to asynchrony, we require processors to perform queries and to maintain the structure independently. Although several such structures have been proposed, the analysis of these structures has been very limited. We here ut;e the recently proposed QRQW PRAM model to provide upper and lower bounds on the performance of such data structures. In the asynchronous QRQW PRAM, the problem of processors concurrently and independently searching a shared data structure is very similar to the problem of routing packets through a network. Using this as a guide, we introduce the Search-Butterfly, a search structure that combines the efficient packet routing properties of the butterfly graph with the efficient search structure properties of the B-Tree. We analyze the behavior of the Search-Butterfly when the following operations are performed: arbitrary searches, random searches,and random searches, insertions, and deletions. We also provide lower bounds that show that the results are within a factor of O (log n) of optimal where n is the number of keys;in the structure. When the searches are random, the results are within a constant factor of optimal. Many of the proofs are derived from closely related results for packet routing. Others are of independent interest, most notably a method of adding queues to any network belonging to a large class of queuing networks with non-Markovian routing in a manner that allows us to bound the delay experienced by packets in the augmented network.
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefini...
详细信息
ISBN:
(纸本)9781450312134
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefinite and all scalars are nonnegative. At FOCS'11, Jain and Yao gave an NC algorithm that requires O(1/Ε13 log13 mlog n) iterations on input n constraint matrices of dimension m-by-m, where each iteration performs at least Δ(mω) work since it involves computing the spectral decomposition. We present a simpler NC parallel algorithm that on input with n constraint matrices, requires O(1/Ε4 log4 n log(1/Ε )) iterations, each of which involves only simple matrix operations and computing the trace of the product of a matrix exponential and a positive semidefinite matrix. Further, given a positive SDP in a factorized form, the total work of our algorithm is nearly-linear in the number of non-zero entries in the factorization. Our algorithm can be viewed as a generalization of Young's algorithm and analysis techniques for positive linear programs (Young, FOCS'01 ) to the semidefinite programming setting. Copyright 2012 acm.
暂无评论