The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (...
详细信息
ISBN:
(纸本)9781509021413
The duality between graphs and matrices means that many common graph analyses can be expressed with primitives such as generalized sparse matrix-vector multiplication (SpMSpV) and sparse matrix-matrix multiplication (SpGEMM). Achieving high performance on these primitives is challenging due to limited arithmetic intensity, irregular memory accesses, and significant network communication requirements in the distributed setting. In this paper we implement four graph applications using GraphPad, our optimized multinode implementations of generalized linear algebra primitives such as SpMSpV and SpGEMM. GraphPad is highly flexible to accommodate multiple data layouts, partitioning strategies, and incorporates communication optimizations. Our performance at scale can exceed that of CombBLAS by up to 40×. In addition to GraphPad's performance in a distributed setting, it is also within 2× the performance of GraphMat, a high performance graph framework on a single node for four out of five benchmarks. We also show our communication optimizations and flexibility are critical for good performance on both HPC clusters and commodity cloud platforms.
Hardware performance counters are used as effective proxies to estimate power consumption and runtime. In this paper we present a performance counter-based power and performance modeling and optimization method, and u...
详细信息
ISBN:
(纸本)9781509036837
Hardware performance counters are used as effective proxies to estimate power consumption and runtime. In this paper we present a performance counter-based power and performance modeling and optimization method, and use the method to model four metrics: runtime, system power, CPU power and memory power. The performance counters that compose the models are used to explore some counter-guided optimizations with two large-scale scientific applications: an earthquake simulation and an aerospace application. We demonstrate the use of the method using two power-aware supercomputers, Mira at Argonne National Laboratory and SystemG at Virginia Tech. The counter-guided optimizations result in a reduction in energy by an average of 18.28% on up to 32,768 cores on Mira and 11.28% on up to 128 cores on SystemG for the aerospace application. For the earthquake simulation, the average energy reductions achieved are 48.65% on up to 4,096 cores on Mira and 30.67% on up to 256 cores on SystemG.
In this paper, we propose a hybrid CPU+GPU data structure, that optimizes search operation for frequently accessed search keys. This is based on the working-set structure due to Badiu et al. [1]. The main idea is to m...
详细信息
This paper studies the effects on energy consumption, power draw, and runtime of a modern compute GPU when changing the core and memory clock frequencies, enabling or disabling ECC, using alternate implementations, an...
详细信息
ISBN:
(纸本)9781509036837
This paper studies the effects on energy consumption, power draw, and runtime of a modern compute GPU when changing the core and memory clock frequencies, enabling or disabling ECC, using alternate implementations, and varying the program inputs. We evaluate 34 applications from 5 benchmark suites and measure their power draw over time on a K20c GPU. Our results show that changing the frequency or the program implementation can alter the energy, power, and performance by a factor of two or more. Interestingly, some changes affect these three aspects very unevenly. ECC can greatly increase the runtime and energy consumption, but only on memory-bound codes. Compute-bound codes tend to behave quite differently from memory-bound codes, in particular regarding their power draw. On irregular programs, a small change in frequency can result in a large change in runtime and energy consumption.
High-level tools for analyzing and predicting the performance GPU-accelerated applications are scarce, at best. Although performance modeling approaches for GPUs exist, their complexity makes them virtually impossible...
详细信息
ISBN:
(纸本)9781509036837
High-level tools for analyzing and predicting the performance GPU-accelerated applications are scarce, at best. Although performance modeling approaches for GPUs exist, their complexity makes them virtually impossible to use to quickly analyze the performance of real life applications and obtain easy-to-use, readable feedback. This is why, although GPUs are significant performance boosters in many HPC domains, performance prediction is still based on extensive benchmarking, and performance bottleneck analysis remains a nonsystematic, experience-driven process. In this context, we propose a tool for bottleneck analysis and performance prediction for GPU-accelerated applications. Based on random forest modeling, and using hardware performance counters data, our method can be used to quickly and accurately evaluate application performance on GPU-based systems for different problem characteristics and different hardware generations. We illustrate the benefits of our approach with three detailed use cases: a simple step-by-step example on a parallel reduction kernel, and two classical benchmarks (matrix multiplication and sequence alignment). Our results so far indicate that our statistical modeling is a quick, easy-to-use method to grasp the performance characteristics of applications running on GPUs. Our current work focuses on tackling some of its applicability limitations (more applications, more platforms) and improving its usability (full automation from input to user feedback).
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the sc...
详细信息
ISBN:
(纸本)9781509021413
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the scalability of an application when run on high node counts of distributed memory supercomputers. Communication costs are further increased on modern multi-and many-core architectures, when using more than one MPI process per node, as each process sends and receives messages independently, inducing multiple latencies and contention for resources. In this paper, we use shared memory constructs available in the MPI 3.0 standard to implement an aggregated communication method to minimize the number of inter-node messages to reduce these costs. We compare the performance of this Minimal Aggregated SHared Memory (MASHM) messaging to the standard point-to-point implementation on large-scale supercomputers, where we see that MASHM leads to enhanced strong scalability of a weighted Jacobi relaxation. For this application, we also see that the use of shared memory parallelism through MASHM and MPI 3.0 can be more efficient than using Open Multi-processing (OpenMP). We then present a model for the communication costs of MASHM which shows that this method achieves its goal of reducing latency costs while also reducing bandwidth costs. Finally, we present MASHM as an open source library to facilitate the integration of this efficient communication method into existing distributed memory applications.
As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and stre...
详细信息
ISBN:
(纸本)9781509036837
As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and streaming processing platforms that can scale to thousands of data stream partitions on commodity hardware are a response. However, the programming API provided by these systems is often low-level, requiring substantial custom code that adds to the programmer learning curve and maintenance overhead. Additionally, these systems often lack SQL querying capabilities that have proven popular on Big Data systems like Hive, Impala or Presto. We define a minimal set of extensions to standard SQL for data stream querying and manipulation. These extensions are prototyped in SamzaSQL, a new tool for streaming SQL that compiles streaming SQL into physical plans that are executed on Samza, an open-source distributed stream processing framework. We compare the performance of streaming SQL queries against native Samza applications and discuss usability improvements. SamzaSQL is a part of the open source Apache Samza project and will be available for general use.
Suzaku is a pattern programming framework that enables programmers to create pattern-based parallel MPI programs without writing the MPI message-passing code implicit in the patterns. The purpose of this framework is ...
详细信息
ISBN:
(纸本)9781509036837
Suzaku is a pattern programming framework that enables programmers to create pattern-based parallel MPI programs without writing the MPI message-passing code implicit in the patterns. The purpose of this framework is to simplify message-passing programming and create better structured programs based upon established parallel design patterns. The focus for developing Suzaku is on teaching parallel programming. This paper covers the main features of Suzaku and describes our experiences using it in parallel programming classes.
This paper describes the use of domain decomposition methods for accelerating wave physics simulation. Numerical wave-based methods provide more accurate simulation than geometrical methods, but at a higher computatio...
详细信息
This paper describes the use of domain decomposition methods for accelerating wave physics simulation. Numerical wave-based methods provide more accurate simulation than geometrical methods, but at a higher computation cost as well. In the context of virtual reality, the quality of the results is estimated according to human perception, what makes geometrical methods an interesting approach for achieving real-time physically-based rendering. Here, we investigate a geometrical method based on both beams and rays tracing, which we enhance by two levels of parallelprocessing. Techniques from domain decomposition methods are coupled with classical parallel computing on both shared and distributed memory. Both optic and acoustic renderings are experimented to evaluate the acceleration impact of the domain decomposition scheme. Speedup measurements clearly show the efficiency of using domain decomposition methods for real-time simulation of wave physics. Copyright (C) 2015 John Wiley & Sons, Ltd.
暂无评论