Recently, cluster computing has successfully provided a cost-effective solution for data-intensive applications. In order to make the programming on clusters easy, many programming toolkits such as MPICH, PVM, and DSM...
详细信息
Recently, cluster computing has successfully provided a cost-effective solution for data-intensive applications. In order to make the programming on clusters easy, many programming toolkits such as MPICH, PVM, and DSM have been proposed in past researches. However, these programming toolkits are not easy enough for common users to develop parallel applications. To address this problem, we have successfully implemented the OpenMP programming interface on a software distributed shared memory system called Teamster. On the other hand, we add a scheduling option called Profiled Multiprocessor Scheduling (PMS) into the OpenMP directive. When this option is selected in user programs, a load balance mechanism based on the PMS algorithm will be performed during the execution of the programs. This mechanism can automatically balance the workload among the execution nodes even when the CPU speed and the processor number of each node are not identical. In this paper, we will present the design and implementation of the OpenMP API on Teamster, and discuss the results of performance evaluation in the test bed.
software distributed shared memory (SDSM)systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the net worktraffi...
详细信息
ISBN:
(纸本)9780897919845
software distributed shared memory (SDSM)systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the net worktraffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, in validation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false *** this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out coderegions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is *** evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by
暂无评论