applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with ...
详细信息
ISBN:
(纸本)9781665476522
applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing processing in-memory (PIM) with Hybrid Memory Cube (HMC) overcome bandwidth limitations but fail to achieve high core utilization due to poor task scheduling and synchronization overheads. Moreover, the high memory-per-core ratio available with HMC limits strong scaling. We introduce Dalorex, a hardware-software co-design that achieves high parallelism and energy efficiency, demonstrating strong scaling with >16,000 cores when processing graph and sparse linear algebra workloads. Over the prior work in PIM, both using 256 cores, Dalorex improves performance and energy consumption by two orders of magnitude through (1) a tile-based distributed-memory architecture where each processing tile holds an equal amount of data, and all memory operations are local;(2) a task-based parallel programming model where tasks are executed by the processing unit that is co-located withthe target data;(3) a network design optimized for irregular traffic, where all communication is one-way, and messages do not contain routing metadata;(4) novel traffic-aware task scheduling hardware that maintains high core utilization;and (5) a data-placement strategy that improves work balance. this work proposes architectural and software innovations to provide the greatest scalability to date for running graph algorithms while still being programmable for other domains.
We present a research-based course module to teach computer science students, software developers, and scientists the effects of non-determinism on high performance applications. the course module uses the ANACIN-X so...
详细信息
ISBN:
(纸本)9781665497473
We present a research-based course module to teach computer science students, software developers, and scientists the effects of non-determinism on high performance applications. the course module uses the ANACIN-X software package, a suite of software modules developed by the authors;ANACIN-X provides test cases, analytic tools to run different scenarios (e.g., using different numbers of processes and different communication patterns), and visualization tools for beginner, intermediate, and advanced level understandings in non-determinism. through our course module, students in computer science, software developers, and scientists gain an understanding of non-determinism, how to measure its occurrence in an execution, and how to identify its root causes within an application's code.
the GPU programming model is primarily aimed at the development of applicationsthat run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memo...
详细信息
ISBN:
(纸本)9781665481069
the GPU programming model is primarily aimed at the development of applicationsthat run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memory capacity. To scale GPU applications further, a great engineering effort is typically required: work and data must be divided over multiple GPUs by hand, possibly in multiple nodes, and data must be manually spilled from GPU memory to higher-level memories. We present Lightning: a framework that follows the common GPU programming paradigm but enables scaling to large problems with ease. Lightning supports multi-GPU execution of GPU kernels, even across multiple nodes, and seamlessly spills data to higher-level memories (main memory and disk). Existing CUDA kernels can easily be adapted for use in Lightning, with data access annotations on these kernels allowing Lightning to infer their data requirements and the dependencies between subsequent kernel launches. Lightning efficiently distributes the work/data across GPUs and maximizes efficiency by overlapping scheduling, data movement, and kernel execution when possible. We present the design and implementation of Lightning, as well as experimental results on up to 32 GPUs for eight benchmarks and one real-world application. Evaluation shows excellent performance and scalability, such as a speedup of 57.2x over the CPU using Lighting with 16 GPUs over 4 nodes and 80 GB of data, far beyond the memory capacity of one GPU.
processing-in-memory (PIM) is promising to solve the well-known data movement challenge by performing in-situ computations near the data. Leveraging PIM features is pretty profitable to boost the energy efficiency of ...
详细信息
ISBN:
(纸本)9781665481069
processing-in-memory (PIM) is promising to solve the well-known data movement challenge by performing in-situ computations near the data. Leveraging PIM features is pretty profitable to boost the energy efficiency of applications. Early studies mainly focus on improving the programmability for computation offloading on PIM architectures. they lack a comprehensive analysis of computation locality and hence fail to accelerate a wide variety of applications. In this paper, we present a general-purpose instruction-level offloading technique for near-DRAM PIM architectures, namely IOTPIM, to exploit PIM features comprehensively. IOTPIM is novel with two technical advances: 1) a new instruction offloading policy that fully considers the locality of the whole on-chip cache hierarchy, and 2) an offloading performance benefit prediction model that directly predicts offloading performance benefits of an instruction based on the input dataset characterizes, preserving low analysis overheads. the evaluation demonstrates that IOTPIM can be applied to accelerate a wide variety of applications, including graph processing, machine learning, and image processing. IOTPIM outperforms the state-of-the-art PIM offloading techniques by 1.28x-1.51x while ensuring offloading accuracy as high as 91.89% on average.
distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which ac...
详细信息
ISBN:
(纸本)9781665481069
distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which achieve sub-optimal performance when the cache size is small. Specifically, the vast majority of invalidation messages become useless when evictions are frequent. the problem is troublesome regarding scarce memory resources in data centers. To this end, we propose a self-invalidation protocol Falcon to eliminate invalidation messages. It relies on per-operation timestamps to achieve the global memory order required by sequential consistency (SC). Furthermore, we conduct a comprehensive discussion on the two protocols with an emphasis on the cache size impact. We also implement both protocols atop a recent DSM system, Grappa. the evaluation shows that the optimal protocol can improve the performance of a KV database by 27% and a graph processing application by 71.4% against the vanilla cache-free scheme.
the complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms...
详细信息
ISBN:
(纸本)9781665497473
the complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms of performance, capacity, persistence, etc. Developers of scientific applications face a huge challenge: efficiently exploit the memory system to improve performance, but keep productivity high by using portable solutions. In this work, we present a new API and a method to manage the complexity of modern memory systems. Our portable and abstracted API is designed to identify memory kinds and describe hardware characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries, and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to remodify the code for each platform. Furthermore we present a survey of existing ways to determine sensitivity of application buffers using static code analysis, profiling and benchmarking. We show in a use case that combining these approaches with our API indeed enables a portable and productive method to match application requirements and hardware memory characteristics.
the GraphBLAS C API is mature with an updated specification (version 2.1) and a compliant implementation (SuiteSparse GraphBLAS). We are now focused on GraphBLAS 3.0;the next major GraphBLAS revision. Potential change...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
the GraphBLAS C API is mature with an updated specification (version 2.1) and a compliant implementation (SuiteSparse GraphBLAS). We are now focused on GraphBLAS 3.0;the next major GraphBLAS revision. Potential changes include: (1) a separate math spec to make new language bindings easier to write, (2) better support for user-defined types, rank promotion, and enhanced non-blocking execution, (3) expanded scope of GraphBLAS to address a wider range of applications, and (4) support for complex heterogeneous and distributed systems.
the proceedings contain 153 papers. the topics discussed include: transaction data management optimization based on multi-partitioning in blockchain systems;semi-asynchronous federated learning optimized for NON-IID d...
ISBN:
(纸本)9798350329223
the proceedings contain 153 papers. the topics discussed include: transaction data management optimization based on multi-partitioning in blockchain systems;semi-asynchronous federated learning optimized for NON-IID data communication based on tensor decomposition;HKTGNN: hierarchical knowledge transferable graph neural network-based supply chain risk assessment;DQR-TTS: semi-supervised text-to-speech synthesis with dynamic quantized representation;deep reinforcement learning-based network moving target defense in DPDK;iNUMAlloc: towards intelligent memory allocation for AI accelerators with NUMA;and predictive queue-based low latency congestion detection in data center networks.
We develop a family of parallel algorithms for the SpKAdd operation that adds a collection of k sparse matrices. SpKAdd is a much needed operation in many applications including distributed memory sparse matrix-matrix...
详细信息
ISBN:
(纸本)9781665497473
We develop a family of parallel algorithms for the SpKAdd operation that adds a collection of k sparse matrices. SpKAdd is a much needed operation in many applications including distributed memory sparse matrix-matrix multiplication (SpGEMM), streaming accumulations of graphs, and algorithmic sparsification of the gradient updates in deep learning. While adding two sparse matrices is a common operation in Matlab, Python, Intel MKL, and various GraphBLAS libraries, these implementations do not perform well when adding a large collection of sparse matrices. We develop a series of algorithms using tree merging, heap, sparse accumulator, hash table, and sliding hash table data structures. Among them, hash-based algorithms attain the theoretical lower bounds both on the computational and I/O complexities and perform the best in practice. the newly-developed hash SpKAdd makes the computation of a distributed-memory SpGEMM algorithm at least 2x faster than previous state-of-the-art algorithms.
Disaggregated architecture brings new opportunities to memory-consuming applications like graph processing. It allows one to outspread memory access pressure from local to far memory, providing an attractive alternati...
详细信息
ISBN:
(纸本)9781665481069
Disaggregated architecture brings new opportunities to memory-consuming applications like graph processing. It allows one to outspread memory access pressure from local to far memory, providing an attractive alternative to disk-based processing. Although existing works on general-purpose far memory platforms show great potentials for application expansion, it is unclear how graph processingapplications could benefit from disaggregated architecture, and how different optimization methods influence the overall performance. In this paper, we take the first step to analyze the impact of graph processing workload on disaggregated architecture by extending the GridGraph framework on top of the RDMA-based far memory system. We design Fargraph, a far memory coordination strategy for enhancing graph processing workload. Specifically, Fargraph reduces the overall data movement through a well-crafted, graph-aware data segment offloading mechanism. In addition, we use optimal data segment splitting and asynchronous data buffering to achieve graph iteration-friendly far memory access. We show that Fargraph achieves near-oracle performance for typical in-local-memory graph processing systems. Fargraph shows up to 8.3x speedup compared to Fastswap, the state-of-the-art, general-purpose far memory platform.
暂无评论