Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficien...
详细信息
ISBN:
(纸本)9780818676413
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficiently schedule loops that contain conditional branches. We have developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of software pipelining. APP is designed to achieve optimal or near-optimal performance for any run of iterations while providing efficient code for transitioning between runs. A run is the execution of consecutive iterations that all execute the same path through a loop. APP accomplishes this by using techniques from modulo scheduling and kernel recognition algorithms, the two main approaches for software pipelining loops. We have implemented the APP algorithm in our research compiler and have evaluated its performance by executing its generated code on a VLIW instruction-set simulator. For a processor with five heterogeneous functional units, APP is able to add another 1% to 23% increase in performance over basic software pipelining by effectively pipelining loops with conditional branches.
An upper bound for the trade-off between space and routing time and space needed to store routing information in the processors and the packets is proven. This result is a consequence of a new strategy for the simulat...
详细信息
An upper bound for the trade-off between space and routing time and space needed to store routing information in the processors and the packets is proven. This result is a consequence of a new strategy for the simulation between arbitrary networks. Finally, several techniques are applied for the simulation of networks to demonstrate space-efficiently routing strategies for vertex-symmetric networks.
We describe a new randomized parallel algorithm for computing the planar convex hull of n points on a p processor-memory pair machine communicating through a network for which O(log p) routing is possible. We show tha...
详细信息
We describe a new randomized parallel algorithm for computing the planar convex hull of n points on a p processor-memory pair machine communicating through a network for which O(log p) routing is possible. We show that the algorithm has optimal O(n log n/p) performance provided that p = o(n/log3n). We report extensive computational experience with the algorithm which supports its theoretical claims.
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implement...
详细信息
ISBN:
(纸本)9780897917179
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide significant speedups in the execution of such programs on MIMD computers. The first scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nested barriers without increasing the code size. Using hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting.
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method o...
详细信息
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method overall is the naive method but with virtual-processor numbers computed in Gray-code order. For some permutations, however, the naive method performs very poorly;the best method in these cases is an adaptation of the block BMMC algorithm for parallel disk systems in which the processor elements are treated as independent devices.
Emerging applications in multi-media and the Human Genome Project require storage and searching of large databases of strings - a task for which parallelism seems the only hope. In this paper, we consider the parallel...
详细信息
Emerging applications in multi-media and the Human Genome Project require storage and searching of large databases of strings - a task for which parallelism seems the only hope. In this paper, we consider the parallelism in some of the fundamental problems in compressing strings and in matching large dictionaries of patterns against texts. We present the first work-optimal algorithms for these well-studied problems including the classical dictionary matching problem, optimal compression with a static dictionary and the universal data compression with dynamic dictionary of Lempel and Ziv. All our algorithms are randomized and they are of the Las Vegas type. Furthermore, they are fast, working in time logarithmic in the input size. Additionally, our algorithms seem suitable for a distributed implementation.
暂无评论