Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose...
详细信息
ISBN:
(纸本)9780818679773
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
This paper focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without consi...
详细信息
This paper focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free cache is considered). It is also shown that the penalty of the stalls is in general higher than the effect of spill code. second, we show that in general binding-schemes are more powerful than nonbinding ones for software pipelined schedules. Finally, the main contribution of this paper is an heuristic scheme that schedules some memory operations according to the locality estimated at compile time and other attributes of the dependence graph. The proposed scheme is shown to outperform other heuristic approaches since it achieves a better trade-off between compute and stall time than the others.
The proceedings contains 39 papers from the 8th annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: parallel random access memory;optical parallel process;release consistency;scope...
详细信息
The proceedings contains 39 papers from the 8th annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: parallel random access memory;optical parallel process;release consistency;scope consistency;entry consistency;cache coherence protocol;thread management;weight factoring;rooted tree networks;butterfly networks;memory mapping;network routing table;all to all personalized communication;sample sort;radix sort;blockwise sample;minimum spanning forests;and virtual channels.
The problem of modeling load balancing is considered in a variety of distributed settings. A new direction in diffusive schedules was introduced by considering schedules that are modeled as: w1 = Mw0;wt+1 = βMwt+(1-...
详细信息
The problem of modeling load balancing is considered in a variety of distributed settings. A new direction in diffusive schedules was introduced by considering schedules that are modeled as: w1 = Mw0;wt+1 = βMwt+(1-β)wt-1 for some appropriate β, called the second order schedules. In the idealized setting of weights being real numbers, results indicate that β can be chosen because the second order schedule is significantly faster than the first order method. Consequently, an algorithm that performs coarse load balancing rapidly and can be used in a number of applications is produced.
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sam...
详细信息
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sampling for choosing splitters, were studied. The two were coded in Split-C and were run on a variety of platforms. Results were consistent with theoretical analysis and illustrated the scalability and efficiency of the algorithms.
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work c...
详细信息
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate whether the models correctly predict the execution time of an algorithm implementation. Unlike previous work, which mostly demonstrated a close match between the measured and predicted running times, this paper shows that there are situations in which the models do not precisely predict the actual execution time of an algorithm implementation. second, there is a comparison part in which the models are contrasted with each other in order to determine which model induces the fastest algorithms. Finally, there is an efficiency validation part in which the performance of the model derived algorithms are compared with the performance of highly optimized library routines to show the effectiveness of deriving fast algorithms through the formalisms of the models.
A library, called PAD, of basic parallelalgorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM ...
详细信息
A library, called PAD, of basic parallelalgorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM as a practical programming model, and to provide an organized collection of basic PRAM algorithms for the SB-PRAM under completion at the University of Saarbruecken. We give a brief survey of Fork95, and describe the main components of PAD. Finally we report on the status of the language and library and discuss further developments.
Scheduling problems that are critical and prevalent in practical parallel computing are computed. A polynomial time makespan algorithm that produces a schedule of length O(V+Φ log T), which is therefore an O(log T) a...
详细信息
ISBN:
(纸本)9780897918091
Scheduling problems that are critical and prevalent in practical parallel computing are computed. A polynomial time makespan algorithm that produces a schedule of length O(V+Φ log T), which is therefore an O(log T) approximation is presented to solve these problems. The makespan algorithm can be extended to minimize the weighted average completion time over all the jobs to the same approximation factor of O(log T).
We describe a randomized CRCW PRAM algorithm that finds a minimum spanning forest of an n-vertex graph in O(log n) time and linear work. This shaves a factor of 2log* n off the best previous running time for a linear-...
详细信息
ISBN:
(纸本)9780897918091
We describe a randomized CRCW PRAM algorithm that finds a minimum spanning forest of an n-vertex graph in O(log n) time and linear work. This shaves a factor of 2log* n off the best previous running time for a linear-work algorithm. The novelty in our approach is to divide the computation into two phases, the first of which finds only a partial solution. This idea has been used previously in parallel connected components algorithms.
The methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on netwo...
详细信息
The methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on networks of workstations (NOWs) with arbitrary delays on the links. Emulations were also derived for the wide variety of other unit-delay network architectures on a NOW with high-latency links. The lower bounds that established limits on the degree to which the high latency links were proven, can be mitigated. These bounds demonstrates that overcoming latencies in dataflow types of computations that require access to large local databases is easier.
暂无评论