This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to ...
详细信息
ISBN:
(纸本)9780897918091
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel can emulate up to Q virtual channels, it is possible to route any set of L-bit messages whose paths have congestion C and dilation D in (L+D)C(D log D)1/Q2O(log*(C/D)) bit steps. We also prove a nearly matching lower bound, i.e., for any values of C, D, Q, and L, where C, D≥Q+1 and L = (1+Ω(1))D, we show how to construct a network and a set of L-bit messages whose paths have congestion C and dilation D that require Ω(LCD1/Q) bit steps to route. These upper and lower bounds imply that increasing the queuing capacity Q of each physical channel can speed up a wormhole routing algorithm by a superlinear factor. The results can be translated to the scenario in which each physical channel can transmit B bits simultaneously, and can queue bits from B different messages. In this case, the bounds are (L+D)C(D log D)1/B2O(log* (C/D))/B and Ω(LCD1/B/B), respectively. We also present a simple randomized wormhole routing algorithm for the butterfly network. The algorithm routes a q-relation on the inputs and outputs of an n-input butterfly in O(LQ(q+log n)(log1/Q n) log log(qn)) bit-steps. We present a nearly-matching lower bound that holds for a broad class of algorithms.
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficien...
详细信息
ISBN:
(纸本)9780818676413
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficiently schedule loops that contain conditional branches. We have developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of software pipelining. APP is designed to achieve optimal or near-optimal performance for any run of iterations while providing efficient code for transitioning between runs. A run is the execution of consecutive iterations that all execute the same path through a loop. APP accomplishes this by using techniques from modulo scheduling and kernel recognition algorithms, the two main approaches for software pipelining loops. We have implemented the APP algorithm in our research compiler and have evaluated its performance by executing its generated code on a VLIW instruction-set simulator. For a processor with five heterogeneous functional units, APP is able to add another 1% to 23% increase in performance over basic software pipelining by effectively pipelining loops with conditional branches.
An upper bound for the trade-off between space and routing time and space needed to store routing information in the processors and the packets is proven. This result is a consequence of a new strategy for the simulat...
详细信息
An upper bound for the trade-off between space and routing time and space needed to store routing information in the processors and the packets is proven. This result is a consequence of a new strategy for the simulation between arbitrary networks. Finally, several techniques are applied for the simulation of networks to demonstrate space-efficiently routing strategies for vertex-symmetric networks.
We describe a new randomized parallel algorithm for computing the planar convex hull of n points on a p processor-memory pair machine communicating through a network for which O(log p) routing is possible. We show tha...
详细信息
We describe a new randomized parallel algorithm for computing the planar convex hull of n points on a p processor-memory pair machine communicating through a network for which O(log p) routing is possible. We show that the algorithm has optimal O(n log n/p) performance provided that p = o(n/log3n). We report extensive computational experience with the algorithm which supports its theoretical claims.
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method o...
详细信息
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method overall is the naive method but with virtual-processor numbers computed in Gray-code order. For some permutations, however, the naive method performs very poorly;the best method in these cases is an adaptation of the block BMMC algorithm for parallel disk systems in which the processor elements are treated as independent devices.
暂无评论