Diffracting trees are an effective and highly scalable distributed-parallel technique for shared counting and load balancing. This paper presents the first steady-state combinatorial model and analysis for diffracting...
详细信息
Diffracting trees are an effective and highly scalable distributed-parallel technique for shared counting and load balancing. This paper presents the first steady-state combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our model is simple and sufficiently high level to overcome many implementation specific details, and yet as we will show it is rich enough to accurately predict empirically observed behaviors. As a result of our analysis we were able to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version. We are also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its performance degrades. We believe our model and modeling approach open the way to steady-state analysis of other distributed-parallel structures such am counting networks and elimination trees.
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B con...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the forecasting technique of [Knu73], our algorithm is able to read in, at any time, the `right' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the `right' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. Our analysis technique involves a novel reduction to various maximum occupancy problems. We prove that the expected I/O performance of SRM is efficient under varying sizes of memory and that it compares favorably in practice to disk-striped mergesort (DSM). Our studies indicate that SRM outperforms DSM even when the number D of parallel disks is fairly small.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of whic...
详细信息
ISBN:
(纸本)9780897918091
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entire problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.
In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multit...
详细信息
In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a work-stealing thread scheduler and the BACKER coherence algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), then the expected execution time of a `fully strict' multithreaded computation on P processors, each with an LRU cache of C pages, is O(T1(C)/P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number of page faults incurred by a computation executed on P processors, each with an LRU cache of C pages, is F1(C)+O(CPT∞), where F1(C) is the number of serial page faults. Finally, we give simple bounds on the number of page faults and the space requirements for `regular' divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.
Modern processors have many levels of parallelism arising from multiple functional units and pipeline stages. In this paper, we consider the interplay between instruction scheduling performed by a compiler and instruc...
详细信息
ISBN:
(纸本)9780897918091
Modern processors have many levels of parallelism arising from multiple functional units and pipeline stages. In this paper, we consider the interplay between instruction scheduling performed by a compiler and instruction lookahead performed by hardware. Anticipatory instruction scheduling is the process of rearranging instructions within each basic block so as to minimize the overall completion time of a set of basic blocks in the presence of hardware instruction lookahead, while preserving safety by not moving any instructions beyond basic block boundaries. Anticipatory instruction scheduling delivers many of the benefits of global instruction scheduling by accounting for instruction overlap across basic block boundaries arising from hardware lookahead, without compromising safety (as in some speculative scheduling techniques) or serviceability of the compiled program. We present the first probably optimal algorithm for a special case of anticipatory instruction scheduling for a trace of basic blocks on a machine with arbitrary size lookahead windows. We extend this result for the version of the problem in which a trace of basic blocks is contained within a loop. In addition, we discuss how to modify these special-case optimal algorithms to obtain heuristics for the more general (but NP-hard) problems that occur in practice.
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execut...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execution scheme, which allows the asynchronous system to execute the synchronous program. Previous solutions to this problem provide a solution only for the case where the original program is deterministic. Here, we provide the first solution for the nondeterministic case (e.g. randomized programs). Our scheme is based on a novel agreement protocol for this setting. Our protocol allows n asynchronous processors to agree on n word-sized values in O(n log n log log n) total work. Total work is defined to be the summation of the number of steps performed by all processes (including busy waiting).
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `c...
详细信息
ISBN:
(纸本)9780897918091
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a randomized parallel allocation where each ball can choose a constant number of bins. All results hold with high probability. In the static case all m balls arrive at the same time. We analyze for m = n a very simple optimal class of protocols achieving maximum load O (r√log n/log log n) if r rounds of communication are allowed. This matches the lower bound of [acmR95]. Furthermore, we generalize the protocols to the case of m>n balls. An optimal load of O(m/n) can be achieved using log log n/log(m/n) rounds of communication. Hence, for m = n log log n/log log log n balls this slackness allows to hide the amount of communication. In the `classical balls into bins' game this optimal distribution can only be achieved for m = n log n. In the dynamic variant n of the m balls arrive at the same time and have to be allocated. Each of these initial n balls has a list of m/n successor-balls. As soon as a ball is allocated its successor will be processed. We present an optimal parallel process that allocates all m = n log n balls in O(m/n) rounds. Hence, the expected allocation time is constant. The main contribution of this process is that the maximum allocation time is additionally bounded by O(log log n).
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to ...
详细信息
ISBN:
(纸本)9780897918091
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to route any set of L-bit messages whose paths have congestion C and dilation D in (L+D)C(D log D)1/Q2O(log*(C/D)) bit steps. We also prove a nearly matching lower bound, i.e., for any values of C, D, Q, and L, where C, D≥Q+1 and L = (1+Ω(1))D, we show how to construct a network and a set of L-bit messages whose paths have congestion C and dilation D that require Ω(LCD1/Q) bit steps to route. These upper and lower bounds imply that increasing the queuing capacity Q of each physical channel can speed up a wormhole routing algorithm by a superlinear factor. The results can be translated to the scenario in which each physical channel can transmit B bits simultaneously, and can queue bits from B different messages. In this case, the bounds are (L+D)C(D log D)1/B2O(log* (C/D))/B and Ω(LCD1/B/B), respectively. We also present a simple randomized wormhole routing algorithm for the butterfly network. The algorithm routes a q-relation on the inputs and outputs of an n-input butterfly in O(LQ(q+log n)(log1/Q n) log log(qn)) bit-steps. We present a nearly-matching lower bound that holds for a broad class of algorithms.
暂无评论