Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
ISBN:
(纸本)9781581131857
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem?Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms; applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage); the latter is known to be applicable to evaluating a family of general arithmetic *** putting things in context, we also discuss our “PRAM-on-chip” vision (actually a small update to it), presented at SPAA98.
parallel computing teaching has an important difficulty, there are few tools to directly learn the behavior of the parallelalgorithms and the parallelarchitectures. Normally the student is formed to think in sequent...
ISBN:
(纸本)9780897914680
parallel computing teaching has an important difficulty, there are few tools to directly learn the behavior of the parallelalgorithms and the parallelarchitectures. Normally the student is formed to think in sequential algorithms running in sequential machines. We present PSEE, a tool to reduce the gap between the basic concepts and its utilization. PSEE is an integrated and interactive graphic environment which allows to simulate and evaluate the performance of parallelalgorithms in parallelarchitectures. PSEE permits to manage the main characteristic parameters involved in the system in order to show the tuning grade of the algorithm/architecture couple. PSEE includes a graphic editor for algorithms and architectures in modelled form, an interactive simulator to run (simulate) the algorithm on the architecture and a performance evaluation instrument.
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficien...
详细信息
ISBN:
(纸本)9780818676413
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficiently schedule loops that contain conditional branches. We have developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of software pipelining. APP is designed to achieve optimal or near-optimal performance for any run of iterations while providing efficient code for transitioning between runs. A run is the execution of consecutive iterations that all execute the same path through a loop. APP accomplishes this by using techniques from modulo scheduling and kernel recognition algorithms, the two main approaches for software pipelining loops. We have implemented the APP algorithm in our research compiler and have evaluated its performance by executing its generated code on a VLIW instruction-set simulator. For a processor with five heterogeneous functional units, APP is able to add another 1% to 23% increase in performance over basic software pipelining by effectively pipelining loops with conditional branches.
The problem of online job scheduling on various parallelarchitectures is studied. An O((log log n)/sup 1/2/)-competitive algorithm for online dynamic scheduling on an n*n mesh is given. It is proved that this algorit...
详细信息
The problem of online job scheduling on various parallelarchitectures is studied. An O((log log n)/sup 1/2/)-competitive algorithm for online dynamic scheduling on an n*n mesh is given. It is proved that this algorithm is optimal up to a constant factor. The algorithm is not greedy, and the lower bound proof shows that no greedy-like algorithm can be very good. The upper bound result can be generalized to any fixed-dimensional meshes. Competitive scheduling algorithms for other architectures are given.< >
A number of deterministic parallel programming models with strong safety guarantees are emerging, but similar support for non-deterministic algorithms, such as branch and bound search, remains an open question. We pre...
详细信息
ISBN:
(纸本)9781450304900
A number of deterministic parallel programming models with strong safety guarantees are emerging, but similar support for non-deterministic algorithms, such as branch and bound search, remains an open question. We present a language together with a type and effect system that supports nondeterministic computations with a deterministic-by-default guarantee: nondeterminism must be explicitly requested via special parallel constructs (marked nd), and any deterministic construct that does not execute any nd construct has deterministic input-output behavior. Moreover, deterministic parallel constructs are always equivalent to a sequential composition of their constituent tasks, even if they enclose, or are enclosed by, nd constructs. Finally, in the execution of nd constructs, interference may occur only between pairs of accesses guarded by atomic statements, so then;are no data races, either between atomic statements and unguarded accesses (strong isolation) or between pairs of unguarded access (stronger than strong isolation alone). We enforce the guarantees at compile time with modular checking using novel,extensions to a previously described effect system. Our effect system extensions also enable the compiler to remove unnecessary transactional synchronization. We provide a static semantics, dynamic semantics, and a complete proof of soundness for the language;both with, and without the barrier removal feature. An experimental evaluation shows that our language can achieve good scalability for realistic parallelalgorithms, and that the barrier removal techniques provide significant performance gains.
This paper presents ParGeo, a multicore library for computational geometry. ParGeo contains modules for fundamental tasks including kd-tree based spatial search, spatial graph generation, and algorithms in computation...
详细信息
One obvious question is if there exists an online algorithm that is O(1)-competitive with speed less than two or not. To obtain such an algorithm (if exists), one must exploit the structure of cost functions. Our anal...
详细信息
ISBN:
(纸本)9781611972108
One obvious question is if there exists an online algorithm that is O(1)-competitive with speed less than two or not. To obtain such an algorithm (if exists), one must exploit the structure of cost functions. Our analysis can be extended to show that there exists an O(1)-speed O(1)-competitive algorithm on identical parallel machines.
We present a parallelized and pipelined architecture for a generalized Laguerre-Volterra MIMO system to identify the time-varying neural dynamics underlying spike activities. The proposed architecture consists of a fi...
详细信息
ISBN:
(纸本)9780769543017
We present a parallelized and pipelined architecture for a generalized Laguerre-Volterra MIMO system to identify the time-varying neural dynamics underlying spike activities. The proposed architecture consists of a first stage containing a vector convolution and MAC (Multiply and Accumulation) component;a second stage containing a pre-threshold potential updating unit with an error approximation function component;and a third stage consisting of a gradient calculation unit. A flexible and efficient architecture that can accommodate different design speed requirements are generated. Simulation results are rigorously analyzed. A hardware IP library for versatile models and applications is proposed. The design runs on a Xilinx Virtex-6 FPGA and the processing core produces data samples at a maximum clock rate of 357MHz, which is 4.37 x 10(5) times faster than the corresponding software model running on an AMD Pheono 9750 Quad Core Processor. It occupies 216,766 LUTs, maximum 12 block-RAMs, and 2016 DSP-blocks.
暂无评论