Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. St...
详细信息
Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth. hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributedmemory parallel computers for running medium-scale PICplasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. (c) 2008 Elsevier Inc. All rights reserved.
A shared memory abstraction can be robustly emulated over an asynchronous message passing system where any process can fail by crashing and possibly recover (crash-recovery model), by having (a) the processes exchange...
详细信息
ISBN:
(纸本)0769520863
A shared memory abstraction can be robustly emulated over an asynchronous message passing system where any process can fail by crashing and possibly recover (crash-recovery model), by having (a) the processes exchange messages to synchronize their read and write operations and (b) log key information on their local stable storage. This paper extends the existing atomicity consistency criterion defined for multi-writer/multi-reader shared memory in a crash-stop model, by providing two new criteria for the crash-recovery model. We introduce lower bounds on the log-complexity for each of the two corresponding types of robust shared memory emulations. We demonstrate that our lower bounds are tight by providing algorithms that match them. Besides being optimal, these algorithms have the same message and time complexity as their most efficient counterpart we know of in the crash-stop model.
Both computation- and memory-intensiveness of deep learning models have made the deployment of model inference on edge devices with limited resource and energy budget challenging. Non-Volatile memory (NVM) based in-me...
详细信息
ISBN:
(纸本)9781728170022
Both computation- and memory-intensiveness of deep learning models have made the deployment of model inference on edge devices with limited resource and energy budget challenging. Non-Volatile memory (NVM) based in-memorycomputing has been proposed to reduce data movement as well as energy consumption, which could alleviate the challenge. Racetrack memory is a newly introduced memory technology. It allows high data density fabrication and thus is a good fit for in-memorycomputing. In order to facilitate the deployment of deep learning models on edge devices, we present an racetrack memory based in-memory integer multiplication, which is one of the core operations in compressed deep learning models. The presented multiplication can be constructed efficiently using racetrack memory technique, and perform the logical operations based on the memory cell with partial reuse of the peripheral circuits. In addition to the multiplication architecture, we also propose and apply a novel write optimization method to the integer multiplication, which transforms the required write operations to shift operations for performance and energy efficiency. The resulting design realizes high area and energy efficiency while maintaining comparable performance with its CMOS counterpart.
Although capacities of persistent storage devices evolved rapidly in the last years, the bandwidth between memory and persistent storage devices is still the bottleneck. As loosely coupled data sharing applications ru...
详细信息
ISBN:
(纸本)9780769550886
Although capacities of persistent storage devices evolved rapidly in the last years, the bandwidth between memory and persistent storage devices is still the bottleneck. As loosely coupled data sharing applications running in cluster environment may need an enormous number of files, the access to these files might become the bottleneck. With the quick development of the server and high-speed network, there are many works done on distributedmemory cache to minimize data requests to the centralized filesystem. These systems have the drawback that nodes are coupled together to form a distributed cache statically. It is a difficult administrative task for changing environments like clusters. Current high performance computing resources, support batch job submissions using distributed resource management systems like TORQUE. How to use the resource management system to set up a self-organizing distributedmemory cache on demand has rarely been studied. In this paper, we design a framework for dynamically setting up distributedmemory cache for data sharing applications. Shared files are stored in the distributedmemory cache, which can be accessed transparently and deliver data with high bandwidth. We describe the architecture of the framework, and evaluate its performance for a use case scenario.
It is currently possible to build multiprocessor systems which will support the tightly coupled activity of hundreds to thousands of different instruction streams, or processes. This can be done by coupling many monop...
详细信息
It is currently possible to build multiprocessor systems which will support the tightly coupled activity of hundreds to thousands of different instruction streams, or processes. This can be done by coupling many monoprocessors, or a smaller number of pipelined multiprocessors, through a high concurrency switching network. The switching network may couple processors to memory modules, resulting in a shared memory multiprocessor system, or it may couple processor/memory pairs, resulting in a distributedmemory system.
The need to direct the activity of very many processes simultaneously places qualitatively different demands on a programming language than the direction of a single process. In spite of the different requirements, most languages for multiprocessors have been simple extensions of conventional, single stream programming languages. The extensions are often implemented by way of subroutine calls and have little impact on the basic structure of the language. This paper attempts to examine the underlying conceptual structure of parallel languages for large-scale multiprocessors on the basis of an existing language for shared memory multiprocessors, known as the Force, and to extend the concepts in this language to distributedmemory systems.
Multivariate time series (MTS) classification has been tackled using various methods, including Reservoir computing (RC), which generates efficient vectorized representations like reservoir state (RS). RS shines when ...
详细信息
ISBN:
(纸本)9798350383782;9798350383799
Multivariate time series (MTS) classification has been tackled using various methods, including Reservoir computing (RC), which generates efficient vectorized representations like reservoir state (RS). RS shines when handling extensive classes or training sets but demands longer processing and substantial memory. Addressing this, in this study we present the Parallel Reservoir Echo State Network (PR-ESN), an optimized parallel training and evaluation algorithm rooted in the ESN principle. It leverages both CPU-shared memory and parallel distributedmemory architecture to efficiently capture reservoir state's optimal model space representation, addressing computational challenges in MTS analysis. Distinguishing itself from previous works, PR-ESN combines distributed parallel processing at the network level and shared memory multiprocessing at the node level. This results in reduced memory requirements and faster processing, making it a significant contribution to the field. Key features include PR-ESN's distributed training and evaluation, shared memory parallelization, and MSR concatenation for comprehensive analysis of distributed model space representations. Testing on real-world MTS and benchmark ECG data proves PR-ESN-based classifiers achieve superior accuracy and faster processing times with optimal memory usage. Testing on real-world MTS and benchmark ECG data proves PR-ESN-based classifiers achieve superior accuracy and faster processing times with optimal memory usage.
Byte-addressable Non-volatile memory (NVM) technologies promise higher density and lower cost than DRAM. They have been increasingly employed for data center applications. Despite many previous studies on using NVM in...
详细信息
ISBN:
(纸本)9781665445139
Byte-addressable Non-volatile memory (NVM) technologies promise higher density and lower cost than DRAM. They have been increasingly employed for data center applications. Despite many previous studies on using NVM in a single machine, there remain challenges to best utilize it in a distributed data center environment. This paper presents Gengar, an RDMA-enabled distributed Shared Hybrid memory (DSHM) pool with simple programming APIs on viewing remote NVM and DRAM in a global memory space. We propose to exploit semantics of RDMA primitives to identify frequently-accessed data in the hybrid memory pool, and cache it in distributed DRAM buffers. We redesign RDMA communication protocols to reduce the bottleneck of RDMA write latency by leveraging a proxy mechanism. Gengar also supports memory sharing among multiple users with data consistency guarantee. We evaluate Gengar in a real testbed equipped with Intel Optane DC Persistent DIMMs. Experimental results show that Gengar significantly improves the performance of public benchmarks such as MapReduce and YCSB by up to 70% compared with state-of-the-art DSHM systems.
Large graph analysis is one of the significant applications of distributedcomputing frameworks. The distributedcomputing applications are solved by developing programs over different types of established distributed...
详细信息
ISBN:
(纸本)9781479954964
Large graph analysis is one of the significant applications of distributedcomputing frameworks. The distributedcomputing applications are solved by developing programs over different types of established distributedcomputing frameworks. Since graph analysis and prediction is one of the new trend in data analytics, designing the problems on an in-memory cluster framework which consumes graph data-sets have a significant role in distributedcomputing. Traditional disk-based distributedcomputing framework like hadoop will confine only to a specific group of problems in data analytics. The importance of utilizing the memory of the cluster apart from the disk-based storage space contributes a significant role in reducing the latency and increasing the speedup. The whole work describes the significance of spark-framework in solving graph related problems in a distributed approach using page ranking algorithm and proteome-protein annotation method in Scala.
A distributed shared memory system provides the abstraction of a shared address space on either a network of workstations or a distributed-memory multiprocessor. Although a distributed shared memory system can improve...
详细信息
ISBN:
(纸本)0818656808
A distributed shared memory system provides the abstraction of a shared address space on either a network of workstations or a distributed-memory multiprocessor. Although a distributed shared memory system can improve performance by relaxing the memory consistency model and maintaining memory coherence at a granularity specified by the programmer, the challenge is to offer ease of programming while maintaining high performance. Concord meets this challenge by carefully splitting responsibilities among the programmer, the compiler, and the runtime system. Concord has allowed a single programmer to port several real, large shared-memory parallel programs onto an Intel iPSC/2 in a few weeks and achieve reasonable speedup.
Parallel programmers face the often irreconcilable goals of programmability and performance. HPC systems use distributedmemory for scalability, thereby sacrificing the programmability advantages of shared memory prog...
详细信息
ISBN:
(纸本)9780769550220
Parallel programmers face the often irreconcilable goals of programmability and performance. HPC systems use distributedmemory for scalability, thereby sacrificing the programmability advantages of shared memory programming models. Furthermore, the rapid adoption of heterogeneous architectures, often with non-cache-coherent memory systems, has further increased the challenge of supporting shared memory programming models. Our primary objective is to define a memory consistency model that presents the familiar thread-based shared memory programming model, but allows good application performance on non-cache-coherent systems, including distributedmemory clusters and accelerator-based systems. We propose regional consistency (RegC), a new consistency model that achieves this objective. Results on up to 256 processors for representative benchmarks demonstrate the potential of RegC in the context of our prototype distributed shared memory system.
暂无评论