In the case of imperfect loop nests, some researchers suggested to first use Abu-Sufah's Non-Basic-to-Basic-loop transformation to convert them into perfect loop nests and then apply the unimodular transformation ...
详细信息
ISBN:
(纸本)0780335295
In the case of imperfect loop nests, some researchers suggested to first use Abu-Sufah's Non-Basic-to-Basic-loop transformation to convert them into perfect loop nests and then apply the unimodular transformation approach. This paper identifies some limitations of this approach, describes some improvements, and provides some insights into the problem of restructuring imperfect loop nests.
Many current performance analysis systems offer little more than basic measurement and analysis facilities for locating the sources of poor performance, such as load imbalance, communication overhead, and synchronizat...
详细信息
ISBN:
(纸本)0818674601
Many current performance analysis systems offer little more than basic measurement and analysis facilities for locating the sources of poor performance, such as load imbalance, communication overhead, and synchronization loss. We believe that this is only part of the solution and a system which can provide higher level performance measurement and analysis is highly desirable. In this paper, we describe a new approach to designing performance tuning tools for parallel processing systems. A primary contribution of this work is to explore the way in which the strategies and algorithms used in parallel programs contribute to the poor performance. In order to detect the strategies and algorithms used in parallel programs, a technique called Automatic Program Analysis is used. Our goal is to provide users with higher level performance advices. We present a case study describing how a prototype implementation of our technique was able to identify the performance problem and provide tuning advice.
Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel d...
详细信息
Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. The measured results match analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.
We present a simple and general parallel sorting scheme, ZZ-sort, which can be used to derive a class of efficient in-place sorting algorithms on realistic parallel machine models. We prove a tight bound for the worst...
详细信息
We present a simple and general parallel sorting scheme, ZZ-sort, which can be used to derive a class of efficient in-place sorting algorithms on realistic parallel machine models. We prove a tight bound for the worst case performance of ZZ-sort. We also demonstrate the average performance of ZZ-sort by experimental results obtained on a MasPar parallel computer. Our experiments indicate that ZZ-sort can be incorporated into a distributed memory parallel computer system as a standard routine, and this routine is useful for space critical situations. Finally, we show that ZZ-sort can be used to convert a non-adaptive parallel sorting algorithm into an in-place and adaptive one by considering the problem of sorting an arbitrarily large input on fixed-size reconfigurable meshes.
Whole array operations and array section operations are important features of many data-parallel languages. Efficient implementation of these operations on distributed-memory multicomputers is critical to the scalabil...
详细信息
ISBN:
(纸本)0818674601
Whole array operations and array section operations are important features of many data-parallel languages. Efficient implementation of these operations on distributed-memory multicomputers is critical to the scalability and high-performance of data-parallel programs. We present an approach for analyzing communication patterns induced by array operations and for scheduling message flow based on the information. Our scheduling algorithm guarantees contention-free data transfer and utilizes network resources optimally. It incurs little overhead and is suitable to be used in compilers and in runtime libraries. We also present simulation results that demonstrate the algorithm's superiority to the asynchronous transfer mode that is commonly used for this type of communication.
In this paper, we present an implementation of FEM solver on heterogeneous parallel environment that consists of two different parallel computers (the Fujitsu AP1000 and NEC Cenju-3). We used OLU (On Line University) ...
详细信息
ISBN:
(纸本)0818674601
In this paper, we present an implementation of FEM solver on heterogeneous parallel environment that consists of two different parallel computers (the Fujitsu AP1000 and NEC Cenju-3). We used OLU (On Line University) Network, which is one of the wide area ATM network connecting over 20 universities and research facilities all over Japan. We used substructure method applied in multiple levels for the calculation algorithm. This algorithm has small data dependency in the calculation phase and the number of data transfer between the parallel computers is limited, thus the overhead for the synchronization can be smaller than iterative method. This paper also refers to the parallel triangular the mesh generator based on the Delaunay Triangulation currently being implemented on the Fujitsu AP1000 parallel computer.
Speedup is usually used to reflect the effect of parallel processing systems. But the existing speedup models do not consider the effect of cache, so the effect of cache on several speedup models is analysed in this p...
详细信息
ISBN:
(纸本)0818674601
Speedup is usually used to reflect the effect of parallel processing systems. But the existing speedup models do not consider the effect of cache, so the effect of cache on several speedup models is analysed in this paper.
Cellular automata(CA) are fully parallel computational models and are widely applied to numerically modelling for many complex systems or nonlinear systems, such as fluid dynamics. Those systems are often governed by ...
详细信息
Cellular automata(CA) are fully parallel computational models and are widely applied to numerically modelling for many complex systems or nonlinear systems, such as fluid dynamics. Those systems are often governed by nonlinear partial differential equations which are hard solved by using traditional numerical methods. In this paper, based on CA a general model for a kind of evolutionary physics systems is proposed. As an example, a CA-like model for nonlinear parabolic equation is built by using multi-scalar analysis. The model is applied to several typical problems and satisfactory results are achieved.
Simulation in train traffic rescheduling is strongly characterized by its dynamic, ever changing nature. For the optimization of train traffic rescheduling, modifications such as reallocation of resources are required...
详细信息
ISBN:
(纸本)0818674601
Simulation in train traffic rescheduling is strongly characterized by its dynamic, ever changing nature. For the optimization of train traffic rescheduling, modifications such as reallocation of resources are required during the run-time of simulation. For this reason, different simulation strategies may cause different propagation of delays, and of course, different results of rescheduling. Therefore, an efficient rescheduling requires a simulation procedure to be able to adapt well to different rescheduling strategies for various purposes. Although several simulation models have been proposed in the past, they have not obtained a satisfactory solution in this aspect. This paper presents a method with which multiple rescheduling strategies could be adapted well at different stages. The adaptability and efficiency of the algorithm on overall performance has been examined by experiments. The results of experiments also show that it can get the execution time at the same level with the previous network-based simulation method.
暂无评论