I/O-intensive applications are important workloads of public clouds. Multiple cloud applications co-run on the same physical machine in different virtual machines (VMs), and the shared resources (e.g., disk bandwidth)...
详细信息
ISBN:
(纸本)9781665481069
I/O-intensive applications are important workloads of public clouds. Multiple cloud applications co-run on the same physical machine in different virtual machines (VMs), and the shared resources (e.g., disk bandwidth) are often isolated for fairness. Our investigation shows that the performance of an I/O-intensive application is impacted by both disk bandwidth allocation and the page cache settings in the guest operating system. However, none of prior work considers adjusting the page cache settings for better performance, when the disk bandwidth allocation is adjusted. We therefore propose CSC, a system that collaboratively identifies the appropriate disk bandwidth allocation and page cache settings in the guest operating system of each VM. CSC aims to improve the system-wide I/O throughput of the physical machine, while also improve the I/O throughput of each individual I/O-intensive application in VMs. CSC comprises an online disk bandwidth allocator and an adaptive dirty page setting optimizer. the bandwidth allocator monitors the disk bandwidth utilization and re-allocates some bandwidth from free VMs to busy VMs periodically. After the re-allocation, the optimizer identifies the appropriate dirty page settings in the guest operating system of the VMs using Bayesian Optimization. the experimental results show that CSC improves the performance of I/O-intensive applications by 9.5% on average (up to 17.29%) when 5 VMs are co-located while fairness is guaranteed.
Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. the High-Performance Computing community invests significant effort in the development of architecture-specific ...
详细信息
ISBN:
(纸本)9781665497473
Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. the High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels;instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing;for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT;we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.
An optimized mathematical model has been developed to streamline gas turbine production by targeting the minimization of the overall completion time, known as makespan. the production process is broken down into a ser...
详细信息
Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we ...
详细信息
ISBN:
(纸本)9781665497473
Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we do not have a canonical programming model and a frontend compiler for it. Existing versatile CGRAs, in respect to their execution model, computational capability, and system structure, magnify the difficulty of orchestrating the compiler techniques. It consequently forces designers of the CGRAs to develop the compiler from scratch, working only for their architectures. Such an approach is outdated, given other successful accelerators like GPU and FPGAs. this paper presents a new CGRA compiler framework in order to reduce development efforts of CGRA applications. OpenMP annotated codes are fed into the proposed compiler, as recent OpenMP support device offloading to the accelerators. this property improves the reusability of the existing source code for HPC workloads. the design of the compiler is inspired by LLVM, which is the most famous compiler framework so that the frontend is built to be architecture-independent. In this work, we demonstrate that the proposed compiler can handle different types of CGRAs without changing the source codes. In addition, we discuss the effect of architecture-independent optimization algorithms. We also provide an open-source implementation of the compiler framework at https://***/hal-lab-u-tokyo/CGRAOmp.
Finding patterns in large highly connected datasets is critical for value discovery in business development and scientific research. this work focuses on the problem of subgraph matching on streaming graphs, which pro...
详细信息
ISBN:
(纸本)9781665481069
Finding patterns in large highly connected datasets is critical for value discovery in business development and scientific research. this work focuses on the problem of subgraph matching on streaming graphs, which provides utility in a myriad of real-world applications ranging from social network analysis to cybersecurity. Each application poses a different set of control parameters, including the restrictions for a match, type of data stream, and search granularity. the problem-driven design of existing subgraph matching systems makes them challenging to apply for different problem domains. this paper presents Mnemonic, a programmable system that provides a high-level API and democratizes the development of a wide variety of subgraph matching solutions. Importantly, Mnemonic also delivers key data management capabilities and optimizations to support real-time processing on long-running, high-velocity multi-relational graph streams. the experiments demonstrate the versatility of Mnemonic, as it outperforms several state-of-the-art systems by up to two orders of magnitude.
Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, colorin...
详细信息
ISBN:
(纸本)9781665481069
Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, coloring, and the like. Recent years saw the design of efficient algorithms for this problem across sequential and parallel computational models. However, current algorithms do not work in the setting where the underlying graph changes over time in a dynamic manner via the insertion or deletion of edges. the insertion or deletion of edges. Dynamic algorithms in the sequential setting that obtain the biconnected components of a graph upon insertion or deletion of a single edge are known from over two decades ago. parallel algorithms for this problem are not heavily studied. In this paper, we design shared-memory parallel algorithms that obtain the biconnected components of a graph subsequent to the insertion or deletion of a batch of edges. Our algorithms hence will be capable of exploiting the parallelism adduced due to a batch of updates. We implement our algorithms on an AMD EPYC 7742 CPU having 128 cores. Our experiments on a collection of 10 realworld graphs from multiple classes indicate that our algorithms outperform parallel state-of-the-art static algorithms.
distributed data analysis frameworks are widely used for processing large datasets generated by instruments in scientific fields such as astronomy, genomics, and particle physics. Such frameworks partition petabyte-si...
详细信息
ISBN:
(纸本)9781665481069
distributed data analysis frameworks are widely used for processing large datasets generated by instruments in scientific fields such as astronomy, genomics, and particle physics. Such frameworks partition petabyte-size datasets into chunks and execute many parallel tasks to search for common patterns, locate unusual signals, or compute aggregate properties. When well-configured, such frameworks make it easy to churn through large quantities of data on large clusters. However, configuring frameworks presents a challenge for end users, who must select a variety of parameters such as the blocking of the input data, the number of tasks, the resources allocated to each task, and the size of nodes on which they run. If poorly configured, the result may perform many orders of magnitude worse than optimal, or the application may even fail to make progress at all. Even if a good configuration is found through painstaking observations, the performance may change drastically when the input data or analysis kernel changes. this paper considers the problem of automatically configuring a data analysis application for high energy physics (TopEFT) built upon standard frameworks for physics analysis (Coffea) and distributed tasking (Work Queue). We observe the inherent variability within the application, demonstrate the problems of poor configuration, and then develop several techniques for automatically sizing tasks to meet goals of resource consumption, and overall application completion.
We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution...
详细信息
ISBN:
(纸本)9781665481069
We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support taskbased execution often only support shared memory parallel environments;a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. the first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by taskcentric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-theart implementations.
applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with ...
详细信息
ISBN:
(纸本)9781665476522
applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing processing in-memory (PIM) with Hybrid Memory Cube (HMC) overcome bandwidth limitations but fail to achieve high core utilization due to poor task scheduling and synchronization overheads. Moreover, the high memory-per-core ratio available with HMC limits strong scaling. We introduce Dalorex, a hardware-software co-design that achieves high parallelism and energy efficiency, demonstrating strong scaling with >16,000 cores when processing graph and sparse linear algebra workloads. Over the prior work in PIM, both using 256 cores, Dalorex improves performance and energy consumption by two orders of magnitude through (1) a tile-based distributed-memory architecture where each processing tile holds an equal amount of data, and all memory operations are local;(2) a task-based parallel programming model where tasks are executed by the processing unit that is co-located withthe target data;(3) a network design optimized for irregular traffic, where all communication is one-way, and messages do not contain routing metadata;(4) novel traffic-aware task scheduling hardware that maintains high core utilization;and (5) a data-placement strategy that improves work balance. this work proposes architectural and software innovations to provide the greatest scalability to date for running graph algorithms while still being programmable for other domains.
the GPU programming model is primarily aimed at the development of applicationsthat run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memo...
详细信息
ISBN:
(纸本)9781665481069
the GPU programming model is primarily aimed at the development of applicationsthat run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memory capacity. To scale GPU applications further, a great engineering effort is typically required: work and data must be divided over multiple GPUs by hand, possibly in multiple nodes, and data must be manually spilled from GPU memory to higher-level memories. We present Lightning: a framework that follows the common GPU programming paradigm but enables scaling to large problems with ease. Lightning supports multi-GPU execution of GPU kernels, even across multiple nodes, and seamlessly spills data to higher-level memories (main memory and disk). Existing CUDA kernels can easily be adapted for use in Lightning, with data access annotations on these kernels allowing Lightning to infer their data requirements and the dependencies between subsequent kernel launches. Lightning efficiently distributes the work/data across GPUs and maximizes efficiency by overlapping scheduling, data movement, and kernel execution when possible. We present the design and implementation of Lightning, as well as experimental results on up to 32 GPUs for eight benchmarks and one real-world application. Evaluation shows excellent performance and scalability, such as a speedup of 57.2x over the CPU using Lighting with 16 GPUs over 4 nodes and 80 GB of data, far beyond the memory capacity of one GPU.
暂无评论