distributedshared-memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the sharedmemory. In hybrid DSM systems coherence among caches is maintained...
详细信息
distributedshared-memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the sharedmemory. In hybrid DSM systems coherence among caches is maintained by a software-implemented coherence protocol relying on some hardware support. Hardware support is provided to satisfy every node hit (the common case) and software is invoked only for accesses to remote nodes. In this paper we compare the design and performance of four hybrid distributed shared memory (DSM) organizations by detailed simulation of the same hardware platform. We have implemented the software protocol handlers for the four architectures. The handlers are written in C and assembly code. Coherence transactions are executed in trap and interrupt handlers. Together with the application, the handlers are executed in full detail in execution-driven simulations of six complete benchmarks with coarse-grain and fine-grain sharing. We relate our experience implementing and simulating the software protocols for the four architectures. Because the overhead of remote accesses is very high in hybrid systems, the system of choice is different than for purely hardware systems. (c) 2008 Elsevier B.V. All rights reserved.
The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the sharedmemory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark s...
详细信息
ISBN:
(纸本)9780897918169
The message passing programs are executed with the Parallel Virtual Machine (PVM) library and the sharedmemory programs are executed using TreadMarks. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR), Traveling Salesman (TSP), and Quicksort (QSORT). Two different input data sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). Our execution environment is a set of eight HP735 workstations connected by a 100Mbits per second FDDI network. For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks. More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data by the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks.
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel wit...
详细信息
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (Cache-Coherent Non-Uniform memory Access Architectures) and COMAs ( Cache Only memory Access Architectures). In CC-NUMAs, moving the TLB to sharedmemory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the sharedmemory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low.
Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previ...
详细信息
ISBN:
(纸本)9781450350280
Next-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on using them in a single machine environment. It is still unclear how to best utilize them in distributed, datacenter environments. We introduce distributedshared Persistent memory (DSPM), a new framework for using persistent memories in distributed datacenter environments. DSPM provides a new abstraction that allows applications to both perform traditional memory load and store instructions and to name, share, and persist their data. We built Hotpot, a kernel-level DSPM system that provides lowlatency, transparent memory accesses, data persistence, data reliability, and high availability. The key ideas of Hotpot are to integrate distributedmemory caching and data replication techniques and to exploit application hints. We implemented Hotpot in the Linux kernel and demonstrated its benefits by building a distributed graph engine on Hotpot and porting a NoSQL database to Hotpot. Our evaluation shows that Hotpot outperforms a recent distributed shared memory system by 1.3x to 3.2x and a recent distributed PM-based file system by 1.5x to 3.0x.
We present a computation model to describe a clustered memory hierarchy of distributed shared memory machines. The computation model includes the access to shared data stored in different levels of the hierarchy as we...
详细信息
ISBN:
(纸本)0769509886
We present a computation model to describe a clustered memory hierarchy of distributed shared memory machines. The computation model includes the access to shared data stored in different levels of the hierarchy as well as the transfer of entire blocks of data between different levels of the memory. Pure sharedmemory machines and pure message passing machines can be expressed within the model. As example we use the model to analyze a hierarchical matrix multiplication algorithm.
In this paper, we propose a new library for storing arrays in a distributed fashion on distributedmemory systems. From a programmer's perspective, these arrays behave for arbitrary reads as if they were allocated...
详细信息
ISBN:
(纸本)9798400706202
In this paper, we propose a new library for storing arrays in a distributed fashion on distributedmemory systems. From a programmer's perspective, these arrays behave for arbitrary reads as if they were allocated in sharedmemory. When it comes to writes into these arrays, the programmer has to ensure that all writes are restricted to a fixed range of address that are "owned" by the node executing the writing operation. We show how this design, despite the owner-compute restriction can aid programmer productivity by enabling straight-forward parallelisations of typical array-manipulating codes. Furthermore, we delineate an open-source implementation of the proposed library named Shray. Using the programming interface of Shray, we compare possible hand-parallelised codes of example applications with implementations in other DSM/PGAS systems demonstrating the programming style enabled by Shray, and providing some initial performance figures.
Parallel programming paradigms are commonly characterized by the core metrics of scalability, memory use, ease of use, hardware requirements and resiliency. Increasingly the support of heterogeneous environments, for ...
详细信息
ISBN:
(纸本)9781450344876
Parallel programming paradigms are commonly characterized by the core metrics of scalability, memory use, ease of use, hardware requirements and resiliency. Increasingly the support of heterogeneous environments, for example a mix of CPUs and accelerators, are of interest. Analysis of the semantics of different classes of parallel programming paradigms and their cost leads to DYCE (distributed Yet Common Environment), a sharedmemory, rich but hardware friendly, race and deadlock free parallel programming paradigm that allows for resiliency without the need for explicit check-pointing code. Pointer based structures that span the memory of multiple heterogeneous compute devices are possible. Importantly, data exchange is independent of the specific data structures and does not require serialization and deserialization code, even for data structures such as a dynamic linked radix tree of strings. The analysis shows that DYCE does not require coherence from the system and thus can be executed with near minimal overhead and hardware requirements, including the page table cost for large unified address spaces that span many devices. We demonstrate efficacy with a prototype.
Performance of three binding schemes for memory local to a node is evaluated. Since a large number of cache misses can occur in a large (relative to the cache size) working set, binding at a page fault time alone cann...
详细信息
Performance of three binding schemes for memory local to a node is evaluated. Since a large number of cache misses can occur in a large (relative to the cache size) working set, binding at a page fault time alone cannot efficiently utilize locality of reference at the local memory. In a small working set, the address bound to the local memory at a node miss time is not effective due to low cache miss rates. Our simulation shows that binding at a cache miss time achieves up to 3.1 times and 2.4 times performance of the schemes of binding at a page fault time and at a node miss time respectively.
As AI models grow exponentially in size, memory has emerged as a critical bottleneck for inference at scale. While hardware solutions like Compute Express Link (CXL) promises to solve the problem of memory capacity an...
详细信息
ISBN:
(纸本)9798400715389
As AI models grow exponentially in size, memory has emerged as a critical bottleneck for inference at scale. While hardware solutions like Compute Express Link (CXL) promises to solve the problem of memory capacity and sharing, they require capital investment, and are not widely available. This paper presents RMAI, an in-kernel remote sharedmemory framework tailored for AI inference workloads, offering a transparent, scalable, and cost-effective software alternative to hardware-based memory expansion and sharing solutions. By leveraging the operating system's capabilities, RMAI introduces dynamic virtual memory regions that reduce page faults, minimize overheads associated with user-kernel transitions, and optimize data locality for inference workloads. In this paper, we particularly focus on Mixture-of-Experts (MoE) models. In this initial evaluation we demonstrate that RMAI achieves performance levels comparable to CXL-like architectures, with up to 10x faster expert switching and reduced memory management overhead across large-scale inference tasks compared to disk-based solutions. This work redefines the role of remote sharedmemory in AI systems, positioning it as a practical and high-performance solution for memory capacity and sharing in modern data centers.
暂无评论