Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficien...
详细信息
ISBN:
(纸本)9780818676413
Software pipelining is an aggressive scheduling technique that generates efficient code for loops and is particularly effective for VLIW architectures. Few software pipelining algorithms, however, are able to efficiently schedule loops that contain conditional branches. We have developed an algorithm we call All Paths Pipelining (APP) that addresses this shortcoming of software pipelining. APP is designed to achieve optimal or near-optimal performance for any run of iterations while providing efficient code for transitioning between runs. A run is the execution of consecutive iterations that all execute the same path through a loop. APP accomplishes this by using techniques from modulo scheduling and kernel recognition algorithms, the two main approaches for software pipelining loops. We have implemented the APP algorithm in our research compiler and have evaluated its performance by executing its generated code on a VLIW instruction-set simulator. For a processor with five heterogeneous functional units, APP is able to add another 1% to 23% increase in performance over basic software pipelining by effectively pipelining loops with conditional branches.
The dimension exchange method (DEM) was initially proposed as a load-balancing algorithm for the hypercube structure. It has been generalized to k-ary n-cubes. However the k-ary n-cube algorithm must take many iterati...
详细信息
ISBN:
(纸本)0818676833
The dimension exchange method (DEM) was initially proposed as a load-balancing algorithm for the hypercube structure. It has been generalized to k-ary n-cubes. However the k-ary n-cube algorithm must take many iterations to converge to a balanced state. In this paper we propose a direct method to modify DEM. The new algorithm Direct Dimension Exchange (DDE) method, takes load average in every dimension to eliminate unnecessary load exchange. It balances the load directly without iteratively exchanging the load. This global approach is able to balance the load more accurately and much faster.
The goal of this work is to simplify parallel application development, and thus ease the learning barriers faced by non-experts. It is especially useful where there is little data-parallelism to be recognized by a com...
详细信息
ISBN:
(纸本)9780818675829
The goal of this work is to simplify parallel application development, and thus ease the learning barriers faced by non-experts. It is especially useful where there is little data-parallelism to be recognized by a compiler. The applications programmer need learn the intricacies of only one primary subroutine in order to get the full benefits of the parallel interface. The applications programmer defines a high level concept, the task, that depends only on his application, and not on any particular parallel library. The task is defined by its three phases: (a) the task input, (b) sequential code to execute the task, and (c) any modifications of global variables that occur as a result of the task. In particular, side effects (which change global variable values) must not occur in phase (b). Forcing the user to re-organize his computation in these terms allows us to present the applications programmer with a single global environment visible to all processors (whether on a SMP or a NOW architecture), in the context of a masterslave architecture. Both a shared memory implementation (running on an SGI or SUN Solaris architecture) and a NOW memory implementation (running on top of MPI) are described. The implementations were tested by a naive program for integer factorization, and by a more sophisticated Todd-Coxeter coset enumeration. Integer factorization was chosen so as to exercise the major features of TOP-C in an unambiguous context.
In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can...
详细信息
In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration have also been addressed. The cost of the required hardware support is small and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data-flow analysis, have been implemented on the Polaris compiler [17].From our simulation study using the Perfect Club benchmarks, we found that, in spite of the conservative analysis made by the compiler, the performance of the proposed HSCD scheme can be comparable to that of a full-map hardware directory scheme. With its comparable performance and reduced hardware cost, the scheme can be a viable alternative for large-scale multiprocessors, such as the Cray T3D, that rely on users to maintain data coherence.
This paper presents neural and hybrid (symbolic and subsymbolic) applications downloaded on the distributed computer architecture ArMenX. This machine is articulated around a ring of FPGAs acting as routing resources ...
详细信息
This paper presents neural and hybrid (symbolic and subsymbolic) applications downloaded on the distributed computer architecture ArMenX. This machine is articulated around a ring of FPGAs acting as routing resources as well as fine grain computing resources and thus giving great flexibility. More coarse grain computing resources-Transputer and DSP-tightly coupled via FPGAs give a large application spectrum to the machine, making it possible to implement heterogeneous algorithms efficiently involving both low level (computing intensive) and high level (control intensive) tasks. We first introduce the ArMenX project and the main architecture features. Then, after giving details on the computing of propagation and back-propagation of the multi-layer perceptron on ArMenX, we will focus on a handwritten digit (issued from a zip code data base) recognition application. An original and efficient method, involving three neural networks, is developed. The first two neural networks deal with the 'reading process', and the last neural network, which learned to write, helps to make decisions on the first two network outputs, when they are not confident. Before concluding, the paper presents the work of integration of ArMenX into a high level programming environment, designed to make it easier to take advantage of the architecture flexibility.
The paper describes-from a software engineering perspective-a framework for the formal development of parallel algorithms on arbitrary architectures. The algorithms are synthesised in a transformational way, i.e. by a...
详细信息
Parallaxis is a machine-independent language for data-parallelprogramming, based on sequential Modula-2. programming in Parallaxis is done on a level of abstraction with virtual processors and virtual connections, wh...
详细信息
ISBN:
(纸本)0780320182
Parallaxis is a machine-independent language for data-parallelprogramming, based on sequential Modula-2. programming in Parallaxis is done on a level of abstraction with virtual processors and virtual connections, which may be defined by the application programmer. This paper describes Parallaxis-III, the current version of the language definition, together with a number of parallel sample algorithms.
The performance and cost-performance benefits of parallel systems make them attractive platforms for many applications. But, these are unfortunately offset by the difficulties of programmingparallel computers. Theref...
详细信息
The performance and cost-performance benefits of parallel systems make them attractive platforms for many applications. But, these are unfortunately offset by the difficulties of programmingparallel computers. Therefore, programming tools are the key to achieve greater success in developing applications for parallelarchitectures. This paper describes a new tool, VPEcons, for parallelprogramming development. It uses graphics to assist in the design of parallel programs. To facilitate the portability of the constructor, a VPEcons Builder has also been developed. It is a tool for creating basic component blocks and binding an existing language to the blocks created. The usefulness of the constructor is demonstrated with a parallel discrete-event simulation example and by comparing it with other visual parallelprogramming tools.
The application of artificial neural networks (ANN) in real-time embedded systems demands high performance computers. Miniaturized massively parallelarchitectures are suitable computation platforms for this task. An ...
详细信息
ISBN:
(纸本)0780320182
The application of artificial neural networks (ANN) in real-time embedded systems demands high performance computers. Miniaturized massively parallelarchitectures are suitable computation platforms for this task. An important question which arises is how to establish an effective mapping from ANN algorithms to hardware. In this paper, we demonstrate how an effective mapping can be achieved with our programming environment in close combination with an optimized architecture design targeted for neuro-computing.
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared me...
ISBN:
(纸本)081867038X
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared memory system supporting multiple memory consistency models;supercompilers for massively parallelarchitectures;investigation of some hardware accelerators for relational algebra operations;implementing higher-order gamma on MasPar: a case study;a framework for visual parallelprogramming;parallelizing a PDE solver: experiences with PISCES-MP;efficient scalable mesh algorithms for merging, sorting and selection;and constructing parallel implement at ions with algebraic programming tools.
暂无评论