There are many algorithms for the space-time mapping of nested loops. Some of them even make the optimal choices within their framework. We propose a preprocessing phase for algorithms in the polytope model, which ext...
详细信息
There are many algorithms for the space-time mapping of nested loops. Some of them even make the optimal choices within their framework. We propose a preprocessing phase for algorithms in the polytope model, which extends the model and yields space-time mappings whose schedule is, in some cases, orders of magnitude faster. These are cases in which the dependence graph has small irregularities. The basic idea is to split the index set of the loop nests into parts with a regular dependence structure and apply the existing space-time mapping algorithms to these parts individually. This work is based on a seminal idea in the more limited context of loopparallelization at the code level. We elevate the idea to the model level (our model is the polytope model), which increases its applicability by providing a clearer and wider range of choices at an acceptable analysis cost. Index set splitting is one facet in the effort to extend the power of the polytope model and to enable the generation of competitive tar-get code.
The most promising technique for automatically parallelizing loops when the system cannot determine dependences at compile time is speculative parallelization. Also called thread-level speculation, this technique assu...
详细信息
The most promising technique for automatically parallelizing loops when the system cannot determine dependences at compile time is speculative parallelization. Also called thread-level speculation, this technique assumes optimistically that the system can execute all iterations of a given loop in parallel. A hardware or software monitor divides the iterations into blocks and assigns them to different threads, one per processor, with no prior dependence analysis. If the system discovers a dependence violation at runtime, it stops the incorrectly computed work and restarts it with correct values. Of course, the more parallel the loop, the more benefits this technique delivers. To better understand how speculative parallelization works, it is necessary to distinguish between private and shared variables. Informally speaking, private variables are those that the program always modifies in each iteration before using them. On the other hand, values stored in shared variables are used in different iterations.
An approach permitting to build free schedules for the RNA folding algorithm is proposed. The statements can be executed in parallel as soon as all their operands are available. This technique requires exact dependenc...
详细信息
ISBN:
(纸本)9783319393841
An approach permitting to build free schedules for the RNA folding algorithm is proposed. The statements can be executed in parallel as soon as all their operands are available. This technique requires exact dependence analysis for automaticparallelization of the Nussinov algorithm. To describe and implement the algorithm the dependence analysis by Pugh and Wonnacott was chosen where dependencies are found in the form of tuple relations. The approach has been implemented and verified by means of the islpy and CLooG tools as a part of the TRACO compiler. The experimental study presents speed-up, scalability and costs of parallelism of the output code. Related work and future tasks are described.
We present an algorithm for solving Diophantine equations which are linear in the variables, but non-linear in one parameter. We are looking for the pointwise solutions, i.e., the solutions for the unknowns in depende...
详细信息
ISBN:
(纸本)9780769535234
We present an algorithm for solving Diophantine equations which are linear in the variables, but non-linear in one parameter. We are looking for the pointwise solutions, i.e., the solutions for the unknowns in dependence of the value of the parameter. Solving Diophantine equations is central to computing the data dependences of certain codes (loops with certain array accesses) which often occur in scientific computing. Our algorithm enables the computation of data dependences in more general situations than is possible with current algorithms.
Artificial neural networks (ANNs) are used often to solve a wide variety of problems using high performance computing. The paper presents automatic loop parallelization for selected ANNs programs by means of the TRACO...
详细信息
ISBN:
(纸本)9783319071725;9783319071732
Artificial neural networks (ANNs) are used often to solve a wide variety of problems using high performance computing. The paper presents automatic loop parallelization for selected ANNs programs by means of the TRACO compiler that permits us to extract loop dependences and produce synchronization-free slices including loop statement instances. Coarse-grained parallelism of nested program loops is obtained by creating a thread of computations on each processor to be executed independently. Program loops of recurrent and back-propagation networks are analysed. The speed-up and efficiency of parallel programs produced by means of TRACO are studied. Related compilers and ANNs parallelization techniques are considered. Future work is outlined.
In this paper, we develop an automatic compile-time computation and data decomposition technique for distributed-memory machines. Our method handles complex programs containing perfect and non-perfect loop nests with ...
详细信息
In this paper, we develop an automatic compile-time computation and data decomposition technique for distributed-memory machines. Our method handles complex programs containing perfect and non-perfect loop nests with or without loop-carried dependences. Applying our algorithms, a program will be divided into collections (called clusters) of loop nests, such that data redistributions are allowed only between the clusters. Within each cluster of loop nests, decomposition and data locality constraints are formulated as a system of homogeneous linear equations which is solved by polynomial time algorithms. Our algorithm can selectively relax data locality constraints within a cluster to achieve a balance between parallelism and data locality. Such relaxations are guided by exploiting the hierarchical program nesting structures from outer to inner nesting levels to keep the communications at a outer-most level possible.
暂无评论