We present parallelalgorithms that maintain a 2-3 tree under insertions and deletions. The algorithms are designed for an extension of Valiant's BSP model, BSP*, that and reduction of the overhead involved in com...
详细信息
ISBN:
(纸本)9780897918091
We present parallelalgorithms that maintain a 2-3 tree under insertions and deletions. The algorithms are designed for an extension of Valiant's BSP model, BSP*, that and reduction of the overhead involved in communication. The BSP*-model is introduced by Baumker et al. in [2]. Our analysis of the data structure goes beyond standard asymptotic analysis: We use Valiant's notion of c-optimality. Intuitively c-optimal algorithms tend to speedup p/c with growing input size (p denotes the number of processors), where the communication time is asymptotically smaller than the computation time. Our first approach allows 1-optimal searching and amortized c-optimal insertion and deletion for a small constant c. The second one allows 2-optimal searching, and c-optimal deletion and insertion for a small constant c. Both results hold with probability 1-o(1) for wide ranges of BSP*-parameters, where the ranges become larger with growing input sizes. The first approach allows much larger ranges. Further, both approaches are memory efficient, their total amount of memory used is proportional to the size m of the set being stored. Our results improve previous results by supporting a fully dynamic search tree rather than a static one, and by significantly reducing the communication time. Further our algorithms use blockwise communication.
Modern processors have many levels of parallelism arising from multiple functional units and pipeline stages. In this paper, we consider the interplay between instruction scheduling performed by a compiler and instruc...
详细信息
ISBN:
(纸本)9780897918091
Modern processors have many levels of parallelism arising from multiple functional units and pipeline stages. In this paper, we consider the interplay between instruction scheduling performed by a compiler and instruction lookahead performed by hardware. Anticipatory instruction scheduling is the process of rearranging instructions within each basic block so as to minimize the overall completion time of a set of basic blocks in the presence of hardware instruction lookahead, while preserving safety by not moving any instructions beyond basic block boundaries. Anticipatory instruction scheduling delivers many of the benefits of global instruction scheduling by accounting for instruction overlap across basic block boundaries arising from hardware lookahead, without compromising safety (as in some speculative scheduling techniques) or serviceability of the compiled program. We present the first probably optimal algorithm for a special case of anticipatory instruction scheduling for a trace of basic blocks on a machine with arbitrary size lookahead windows. We extend this result for the version of the problem in which a trace of basic blocks is contained within a loop. In addition, we discuss how to modify these special-case optimal algorithms to obtain heuristics for the more general (but NP-hard) problems that occur in practice.
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execut...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execution scheme, which allows the asynchronous system to execute the synchronous program. Previous solutions to this problem provide a solution only for the case where the original program is deterministic. Here, we provide the first solution for the nondeterministic case (e.g. randomized programs). Our scheme is based on a novel agreement protocol for this setting. Our protocol allows n asynchronous processors to agree on n word-sized values in O(n log n log log n) total work. Total work is defined to be the summation of the number of steps performed by all processes (including busy waiting).
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `c...
详细信息
ISBN:
(纸本)9780897918091
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a randomized parallel allocation where each ball can choose a constant number of bins. All results hold with high probability. In the static case all m balls arrive at the same time. We analyze for m = n a very simple optimal class of protocols achieving maximum load O (r√log n/log log n) if r rounds of communication are allowed. This matches the lower bound of [acmR95]. Furthermore, we generalize the protocols to the case of m>n balls. An optimal load of O(m/n) can be achieved using log log n/log(m/n) rounds of communication. Hence, for m = n log log n/log log log n balls this slackness allows to hide the amount of communication. In the `classical balls into bins' game this optimal distribution can only be achieved for m = n log n. In the dynamic variant n of the m balls arrive at the same time and have to be allocated. Each of these initial n balls has a list of m/n successor-balls. As soon as a ball is allocated its successor will be processed. We present an optimal parallel process that allocates all m = n log n balls in O(m/n) rounds. Hence, the expected allocation time is constant. The main contribution of this process is that the maximum allocation time is additionally bounded by O(log log n).
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to ...
详细信息
ISBN:
(纸本)9780897918091
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to route any set of L-bit messages whose paths have congestion C and dilation D in (L+D)C(D log D)1/Q2O(log*(C/D)) bit steps. We also prove a nearly matching lower bound, i.e., for any values of C, D, Q, and L, where C, D≥Q+1 and L = (1+Ω(1))D, we show how to construct a network and a set of L-bit messages whose paths have congestion C and dilation D that require Ω(LCD1/Q) bit steps to route. These upper and lower bounds imply that increasing the queuing capacity Q of each physical channel can speed up a wormhole routing algorithm by a superlinear factor. The results can be translated to the scenario in which each physical channel can transmit B bits simultaneously, and can queue bits from B different messages. In this case, the bounds are (L+D)C(D log D)1/B2O(log* (C/D))/B and Ω(LCD1/B/B), respectively. We also present a simple randomized wormhole routing algorithm for the butterfly network. The algorithm routes a q-relation on the inputs and outputs of an n-input butterfly in O(LQ(q+log n)(log1/Q n) log log(qn)) bit-steps. We present a nearly-matching lower bound that holds for a broad class of algorithms.
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficien...
详细信息
ISBN:
(纸本)9780818676413
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficiently schedule loops that contain conditional branches. We have developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of software pipelining. APP is designed to achieve optimal or near-optimal performance for any run of iterations while providing efficient code for transitioning between runs. A run is the execution of consecutive iterations that all execute the same path through a loop. APP accomplishes this by using techniques from modulo scheduling and kernel recognition algorithms, the two main approaches for software pipelining loops. We have implemented the APP algorithm in our research compiler and have evaluated its performance by executing its generated code on a VLIW instruction-set simulator. For a processor with five heterogeneous functional units, APP is able to add another 1% to 23% increase in performance over basic software pipelining by effectively pipelining loops with conditional branches.
暂无评论