In software distributed shared memory (SDSM) systems, the large coherence granularity imposed by virtual memory page size tends to induce false sharing, which may lead to heavy network traffic or useless page misses o...
详细信息
In software distributed shared memory (SDSM) systems, the large coherence granularity imposed by virtual memory page size tends to induce false sharing, which may lead to heavy network traffic or useless page misses on barrier operations. In this paper, we propose a method to alleviate the coherence overhead of barrier synchronization in the SDSM systems. It performs static analysis on a shared-memory program to examine data dependency between processors across global barriers, and then special primitives are inserted into the program in order to exploit the dependency information at run time. If the data modified before a barrier will be accessed by some of the other processors after the barrier, coherence messages are transferred only to the processors through the inserted primitives. Furthermore, if the modified data will not be used by any other processors, the primitives enforce the coherence messages to be delivered only to master process after the parallel execution of the program completes. We implemented the static analysis with SUIF parallelizing compiler and then evaluated the execution performance of modified programs in a 16-node SDSM system supporting AURC protocol. The experimental results show that our method is very effective at reducing the useless coherence messages, and also can improve the execution time substantially by reducing false sharing misses.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization ...
详细信息
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks softwaredistributedshared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware sharedmemory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware sharedmemory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups or multithreaded TreadMarks programs are within 7-30% of the MPI versions. (C) 2000 Academic Press.
software distributed shared memory (SDSM)systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the net worktraffi...
详细信息
ISBN:
(纸本)9780897919845
software distributed shared memory (SDSM)systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the net worktraffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, in validation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false *** this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out coderegions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is *** evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by
暂无评论