The performance of distributed shared memory depends on the memory coherence algorithms and the access characteristics of shared data. In this paper, we propose an efficient coherence scheme using multiple coherence a...
详细信息
The performance of distributed shared memory depends on the memory coherence algorithms and the access characteristics of shared data. In this paper, we propose an efficient coherence scheme using multiple coherence algorithms with self-adjusting feature. Our method can dynamically choose a more adaptive coherence algorithm for each variable class and the incorrect classification of shared variables will not affect the performance. We show that for each fixed classification, application programs suffer 5.1%, 4.6%, and 48.9% increases in the average execution time, when compared against the performance of a self-adjusting scheme. Experiments have shown our approach achieving good performance.
An overview of distributed shared memory (DSM) issues is presented. memory coherence, design choices, and implementation methods are included. The discussion of design choices covers structure and granularity, coheren...
详细信息
An overview of distributed shared memory (DSM) issues is presented. memory coherence, design choices, and implementation methods are included. The discussion of design choices covers structure and granularity, coherence semantics, scalability, and heterogeneity. Implementation issues concern data location and access, the coherence protocol, replacement strategy, and thrashing. Algorithms that support process synchronization and memory management are discussed
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While t...
详细信息
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
distributed shared memory (DSM) is an important technology that provides programmers the underlying execution mechanism for sharedmemory programs. To improve the performance of DSM, recent studies have been carried o...
详细信息
ISBN:
(纸本)9781457702518
distributed shared memory (DSM) is an important technology that provides programmers the underlying execution mechanism for sharedmemory programs. To improve the performance of DSM, recent studies have been carried out with introducing compiler assistance. The compiler generates codes for dependency analysis and communication. This paper proposes high-performance DSM, called Offloaded-DSM, in which the processes of dependency analysis and communication are offloaded to the cluster network. In Offloaded-DSM, the host machine can concentrate on computation of an application itself, while the network maintains coherency in parallel. Through the results of preliminary evaluation, Offloaded-DSM reduces execution time up to 32% in eight nodes and exhibits good scalability.
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While t...
详细信息
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
Live virtual machine migration is an essential tool for dynamic resource management in current data centers. Many techniques have been developed to achieve this goal with minimum service interruption. In this paper, w...
详细信息
ISBN:
(纸本)9781509026197
Live virtual machine migration is an essential tool for dynamic resource management in current data centers. Many techniques have been developed to achieve this goal with minimum service interruption. In this paper, we propose a pre-copy live VM migration using distributed shared memory (DSM) computing model. The setup is built using two identical computation nodes to construct the environment services architecture namely the virtualization infrastructure, the shared storage server, and the DSM and High Performance Computing (HPC) cluster. The custom DSM framework is based on a low latency memory update Grappa. HPC cluster with OPENMPI and MPI libraries support parallelization and auto-parallelization work load by using CPUs computation nodes. The DSM allows the cluster CPUs to access the same memory space pages resulting in a lower memory data updates based on locality attributes updates, which reduces the amount of data transferred through the network. This model achieves a good enhancement of the live VM migration metrics. Downtime is reduced by 50% in the idle workload of Windows VM and 66.6% in case of Ubuntu Linux idle workload. In general, this model not only reduces the downtime and the total amount of data sent, but also does not degrade other metrics like the total migration time and the application performance.
Coordinating mobile robots are widely used in commercial and industrial settings to fulfill various tasks. However, to program the coordination among mobile robots is challenging. A coordination framework is needed to...
详细信息
Coordinating mobile robots are widely used in commercial and industrial settings to fulfill various tasks. However, to program the coordination among mobile robots is challenging. A coordination framework is needed to shield the programmer from handling low-level details of robot control and communication, while supporting flexible and cost-effective coordination at the same time. The coordination framework should also be able to well coexist with the underlying robot control. To this end, we propose the Coordination-enabled Behavior-Based Robotics(CBBR) framework. CBBR employs distributed shared memory(DSM) to support coordination. The sharedmemory illusion built by the DSM greatly simplifies the coordination logic. Moreover,the flexible access patterns of the DSM and the rich consistency semantics of the DSM reads and writes enable flexible and cost-effective coordination. With the coordination support from the DSM, CBBR naturally extends the classical Behavior-Based Robotics(BBR) for robot control. From the perspective of robot control using BBR, the shared variables in the DSM act as the logical sensors capturing the status of coordination. The coordination algorithms are encapsulated into coordination behaviors. Thus, the physical environment status and the coordination status may trigger the physical and the coordination behaviors. The scheduling of both types of behaviors integrates coordination into robot control. We conduct a case study to demonstrate the use of CBBR. The performance measurements show the cost-effectiveness of coordinating mobile robots based on CBBR, in terms of time, space, and energy consumption.
When threads are migrated from heavily loaded nodes to lightly loaded nodes for load balance in software distributed shared memory systems, the communication cost of maintaining data consistency is increased if migrat...
详细信息
When threads are migrated from heavily loaded nodes to lightly loaded nodes for load balance in software distributed shared memory systems, the communication cost of maintaining data consistency is increased if migration threads are carelessly selected. Program performance is degraded when loss from increased communication exceeds the benefit from load balancing. This study addresses the problem with a novel selection policy called reduction of inter-node sharing costs. The main characteristic of this policy is simultaneously considering thread memory access types and global sharing. The experimental results show that this policy can reduce the communication of benchmark applications by 50% during load balancing. (C) 2002 Elsevier Science B.V. All rights reserved.
THE CRAY X1 SUPERCOMPUTER'S distributed shared memory PRESENTS A 64-BIT GLOBAL ADDRESS SPACE THAT IS DIRECTLY ADDRESSABLE FROM EVERY MSP WITH AN INTERCONNECT BANDWIDTH PER COMPUTATION RATE OF 1 BYTE/FLOP. OUR RESU...
详细信息
THE CRAY X1 SUPERCOMPUTER'S distributed shared memory PRESENTS A 64-BIT GLOBAL ADDRESS SPACE THAT IS DIRECTLY ADDRESSABLE FROM EVERY MSP WITH AN INTERCONNECT BANDWIDTH PER COMPUTATION RATE OF 1 BYTE/FLOP. OUR RESULTS SHOW THAT THIS HIGH BANDWIDTH AND LOW LATENCY FOR REMOTE memory ACCESSES TRANSLATE INTO IMPROVED APPLICATION PERFORMANCE ON IMPORTANT APPLICATIONS.
Recent advances in the development of optical technologies suggest the possible emergence of broadcast-based optical interconnects within cache-coherent distributed shared memory (DSM) multiprocessor architectures. It...
详细信息
Recent advances in the development of optical technologies suggest the possible emergence of broadcast-based optical interconnects within cache-coherent distributed shared memory (DSM) multiprocessor architectures. It is well known that the cache-coherence protocol is a critical issue in designing such architectures because it directly affects memory latencies. In this paper, we evaluate via simulation the performance of three directory-based cache-coherence protocols;strict request-response, intervention forwarding and reply forwarding on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), which is a low-latency and high-bandwidth broadcast-based fiber-optic interconnection network supporting DSM. The simulated system contains 64 nodes, each of which has a processor, a cache controller, a directory controller and an output channel. Simulations have been conducted for each protocol to measure average processor utilization, average network latency and average number of packets transferred over the network for varying values of the important DSM parameters such as the ratio of the mean channel service time to mean thread run time (T/R), probability of a cache block being in modified state {P(M)}, the fraction of write misses {P(W)} and home node contention rate. The results reveal that for all cases. except for low values of P(M), intervention forwarding gives the worst performance (lowest processor utilization and highest latency). The performance of strict request-response and reply forwarding is comparable for several values of the DSM parameters and contention rate. For a contention rate of 0%. the increase of P(M) makes reply forwarding perform better than strict request-response. The performance of all protocols decreases with the increase of P(W) and contention rate. However, the performance of strict request-response is the least affected among other protocols due to the negative impact of the increase of P(W) and contention rate. Therefore, for t
暂无评论