We present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated i...
详细信息
We present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster's memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributedshared memories are compared, and optimizations for efficient memory use are discussed. (c) 2005 Elsevier B.V. All rights reserved.
This paper evaluates the performance of a novel Yew-Oriented Parallel Programming style for parallel programming on cluster computers. View-Oriented Parallel Programming is based on distributed shared memory which is ...
详细信息
ISBN:
(纸本)0769523803
This paper evaluates the performance of a novel Yew-Oriented Parallel Programming style for parallel programming on cluster computers. View-Oriented Parallel Programming is based on distributed shared memory which is friendly and easy for programmers to use. It requires the programmer to divide shared data into views according to the memory access pattern of the parallel algorithm. One of the advantages of this programming style is that it offers the performance potential for the underlying distributed shared memory system to optimize consistency maintenance. Also it allows the programmer to participate in performance optimization of a program through wise partitioning of the shared data into views. Experimental results demonstrate a significant performance gain of the programs based on the View-Oriented Parallel Programming style.
In this paper, we propose, present and analyze the behavior and the performance of a reconfigurable algorithm for shared objects consistency management in distributed systems. Object sharing allows nodes to concurrent...
详细信息
ISBN:
(纸本)0769524869
In this paper, we propose, present and analyze the behavior and the performance of a reconfigurable algorithm for shared objects consistency management in distributed systems. Object sharing allows nodes to concurrently/parallel access a same set of replicated objects. However, it is necessary that the nodes know when and how to do these accesses, avoiding inconsistencies in the objects state. The RCA (Reconfigurable Consistency Algorithm) is a reconfigurable algorithm that guarantees the objects consistency. This algorithm modifies its behavior and structure according to the changes in the workload and distributed systems parameters. The paper shows that: the use of RCA generates flexibility and improves the performance in 30%, on average.
This paper presents a reconfigurable computing environment for building hierarchical traffic telematics distributed systems based on non-locking distributed shared memory algorithm. The algorithm aims mainly at minimi...
详细信息
ISBN:
(纸本)1842331159
This paper presents a reconfigurable computing environment for building hierarchical traffic telematics distributed systems based on non-locking distributed shared memory algorithm. The algorithm aims mainly at minimising the total amount of time for data retrieval in network of work-stations, considering the point of view of distributed traffic modules. The framework presented in this paper adopts a non-locking model to achieve the required performance. The presented framework develops further the successful features of DINE (developed and designed at SOCI, NTU) and at the same time avoids its shortcomings. The experimental results show that the new framework outperforms the old design of the system.
This paper presents a computing environment for building hierarchical traffic telematics distributed systems based on non-locking distributed shared memory algorithm. The algorithm aims mainly at minimising the total ...
详细信息
ISBN:
(纸本)8955191235
This paper presents a computing environment for building hierarchical traffic telematics distributed systems based on non-locking distributed shared memory algorithm. The algorithm aims mainly at minimising the total amount of time for data retrieval in network of workstations, considering the point of view of distributed traffic modules. The framework presented in this paper adopts a non-locking model to achieve the required performance. The presented framework develops further the successful features of DIME (developed and designed at SOCI, NTU) and at the same time avoids its shortcomings. The experimental results show that the new framework outperforms the old design of the system.
We present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated i...
详细信息
We present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster's memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributedshared memories are compared, and optimizations for efficient memory use are discussed. (c) 2005 Elsevier B.V. All rights reserved.
DSM systems provide an easy-to-use programming model for parallel and distributed systems, but it is sometimes difficult to reach the performance characteristics of low-level message-passing programs, in particular if...
详细信息
ISBN:
(纸本)076952513X
DSM systems provide an easy-to-use programming model for parallel and distributed systems, but it is sometimes difficult to reach the performance characteristics of low-level message-passing programs, in particular if these have been optimized towards a specific architecture. In this article, we propose a multi-layered realization of a DSM system which provides different programming abstractions, including a level which allows an explicit control of the data placement. The programmer can select an appropriate level of abstraction for his application and it is even possible to mix program parts realized at different abstraction levels. The article gives a description of the multi-layered model, describes a prototype realization of the system and presents some preliminary experimental results on a heterogeneous system.
Performance of three binding schemes for memory local to a node is evaluated. Since a large number of cache misses can occur in a large (relative to the cache size) working set, binding at a page fault time alone cann...
详细信息
Performance of three binding schemes for memory local to a node is evaluated. Since a large number of cache misses can occur in a large (relative to the cache size) working set, binding at a page fault time alone cannot efficiently utilize locality of reference at the local memory. In a small working set, the address bound to the local memory at a node miss time is not effective due to low cache miss rates. Our simulation shows that binding at a cache miss time achieves up to 3.1 times and 2.4 times performance of the schemes of binding at a page fault time and at a node miss time respectively.
This paper describes Proteus, a distributed shared memory (DSM) system which supports runtime node reconfiguration. Proteus allows users to change the node set during the execution of a DSM program. The capability of ...
详细信息
This paper describes Proteus, a distributed shared memory (DSM) system which supports runtime node reconfiguration. Proteus allows users to change the node set during the execution of a DSM program. The capability of node addition allows users to further shorten the execution time of their DSM programs by dynamically adding newly available nodes to the system. Furthermore, competition for resources between system users and computer owners can be avoided by dynamically deleting nodes from the system. To make the system adapt to the node configuration efficiently, Proteus employs several techniques, including adaptive workload redistribution, affinity page movement, and forced update. Proteus supports both sequential consistency and release consistency. It provides an object-oriented parallel programming environment. This paper describes the design and implementation of node reconfiguration in Proteus, and presents the performance of the system. Experimental results indicate that Proteus can further improve the performance of the tested programs by taking advantage of node reconfiguration. Our results further demonstrate that the techniques employed in Proteus minimize communication and overhead. (C) 2001 Elsevier Science Inc. All rights reserved.
distributed shared memory (DSM) multiprocessors typically require disjoint networks for deadlock-free execution of cache coherence protocols. This is normally achieved by implementing virtual networks with the help of...
详细信息
distributed shared memory (DSM) multiprocessors typically require disjoint networks for deadlock-free execution of cache coherence protocols. This is normally achieved by implementing virtual networks with the help of virtual channels or virtual lanes multiplexed on a single physical network. To keep the coherence protocol simple, messages are usually assigned to virtual lanes in a predefined static manner based on a cycle-free lane assignment dependence graph. However, this static split of virtual networks ( such as request and reply networks) may lead to underutilization of certain virtual networks while saturating the other networks. In this paper, we explore different static and dynamic schemes to select the virtual lanes for outgoing messages and mix the load among them without restricting any particular type of message to be carried only by a particular virtual network. We achieve this by exposing the selection algorithms to the coherence protocol itself, so that it can inject messages into selected virtual lanes based on some local information, and still enjoy deadlock-freedom. Our execution-driven simulation on five applications from the SPLASH-2 suite shows that as the system scales, the virtual network selection algorithms play an important role. For 128-node systems, our dynamic selection algorithm speeds up parallel execution by as much as 22 percent over an optimized baseline system running a modified SGI Origin 2000 protocol. We also explore how network latency, the number of message buffers per virtual lane, and the depth of network interface output queues affect the relative performance of various virtual lane selection algorithms.
暂无评论