Hardware performance counters are used as effective proxies to estimate power consumption and runtime. In this paper we present a performance counter-based power and performance modeling and optimization method, and u...
详细信息
ISBN:
(纸本)9781509036837
Hardware performance counters are used as effective proxies to estimate power consumption and runtime. In this paper we present a performance counter-based power and performance modeling and optimization method, and use the method to model four metrics: runtime, system power, CPU power and memory power. The performance counters that compose the models are used to explore some counter-guided optimizations with two large-scale scientific applications: an earthquake simulation and an aerospace application. We demonstrate the use of the method using two power-aware supercomputers, Mira at Argonne National Laboratory and SystemG at Virginia Tech. The counter-guided optimizations result in a reduction in energy by an average of 18.28% on up to 32,768 cores on Mira and 11.28% on up to 128 cores on SystemG for the aerospace application. For the earthquake simulation, the average energy reductions achieved are 48.65% on up to 4,096 cores on Mira and 30.67% on up to 256 cores on SystemG.
This paper studies the effects on energy consumption, power draw, and runtime of a modern compute GPU when changing the core and memory clock frequencies, enabling or disabling ECC, using alternate implementations, an...
详细信息
ISBN:
(纸本)9781509036837
This paper studies the effects on energy consumption, power draw, and runtime of a modern compute GPU when changing the core and memory clock frequencies, enabling or disabling ECC, using alternate implementations, and varying the program inputs. We evaluate 34 applications from 5 benchmark suites and measure their power draw over time on a K20c GPU. Our results show that changing the frequency or the program implementation can alter the energy, power, and performance by a factor of two or more. Interestingly, some changes affect these three aspects very unevenly. ECC can greatly increase the runtime and energy consumption, but only on memory-bound codes. Compute-bound codes tend to behave quite differently from memory-bound codes, in particular regarding their power draw. On irregular programs, a small change in frequency can result in a large change in runtime and energy consumption.
High-level tools for analyzing and predicting the performance GPU-accelerated applications are scarce, at best. Although performance modeling approaches for GPUs exist, their complexity makes them virtually impossible...
详细信息
ISBN:
(纸本)9781509036837
High-level tools for analyzing and predicting the performance GPU-accelerated applications are scarce, at best. Although performance modeling approaches for GPUs exist, their complexity makes them virtually impossible to use to quickly analyze the performance of real life applications and obtain easy-to-use, readable feedback. This is why, although GPUs are significant performance boosters in many HPC domains, performance prediction is still based on extensive benchmarking, and performance bottleneck analysis remains a nonsystematic, experience-driven process. In this context, we propose a tool for bottleneck analysis and performance prediction for GPU-accelerated applications. Based on random forest modeling, and using hardware performance counters data, our method can be used to quickly and accurately evaluate application performance on GPU-based systems for different problem characteristics and different hardware generations. We illustrate the benefits of our approach with three detailed use cases: a simple step-by-step example on a parallel reduction kernel, and two classical benchmarks (matrix multiplication and sequence alignment). Our results so far indicate that our statistical modeling is a quick, easy-to-use method to grasp the performance characteristics of applications running on GPUs. Our current work focuses on tackling some of its applicability limitations (more applications, more platforms) and improving its usability (full automation from input to user feedback).
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the sc...
详细信息
ISBN:
(纸本)9781509021413
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the scalability of an application when run on high node counts of distributed memory supercomputers. Communication costs are further increased on modern multi-and many-core architectures, when using more than one MPI process per node, as each process sends and receives messages independently, inducing multiple latencies and contention for resources. In this paper, we use shared memory constructs available in the MPI 3.0 standard to implement an aggregated communication method to minimize the number of inter-node messages to reduce these costs. We compare the performance of this Minimal Aggregated SHared Memory (MASHM) messaging to the standard point-to-point implementation on large-scale supercomputers, where we see that MASHM leads to enhanced strong scalability of a weighted Jacobi relaxation. For this application, we also see that the use of shared memory parallelism through MASHM and MPI 3.0 can be more efficient than using Open Multi-processing (OpenMP). We then present a model for the communication costs of MASHM which shows that this method achieves its goal of reducing latency costs while also reducing bandwidth costs. Finally, we present MASHM as an open source library to facilitate the integration of this efficient communication method into existing distributed memory applications.
As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and stre...
详细信息
ISBN:
(纸本)9781509036837
As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and streaming processing platforms that can scale to thousands of data stream partitions on commodity hardware are a response. However, the programming API provided by these systems is often low-level, requiring substantial custom code that adds to the programmer learning curve and maintenance overhead. Additionally, these systems often lack SQL querying capabilities that have proven popular on Big Data systems like Hive, Impala or Presto. We define a minimal set of extensions to standard SQL for data stream querying and manipulation. These extensions are prototyped in SamzaSQL, a new tool for streaming SQL that compiles streaming SQL into physical plans that are executed on Samza, an open-source distributed stream processing framework. We compare the performance of streaming SQL queries against native Samza applications and discuss usability improvements. SamzaSQL is a part of the open source Apache Samza project and will be available for general use.
This paper describes the use of domain decomposition methods for accelerating wave physics simulation. Numerical wave-based methods provide more accurate simulation than geometrical methods, but at a higher computatio...
详细信息
This paper describes the use of domain decomposition methods for accelerating wave physics simulation. Numerical wave-based methods provide more accurate simulation than geometrical methods, but at a higher computation cost as well. In the context of virtual reality, the quality of the results is estimated according to human perception, what makes geometrical methods an interesting approach for achieving real-time physically-based rendering. Here, we investigate a geometrical method based on both beams and rays tracing, which we enhance by two levels of parallelprocessing. Techniques from domain decomposition methods are coupled with classical parallel computing on both shared and distributed memory. Both optic and acoustic renderings are experimented to evaluate the acceleration impact of the domain decomposition scheme. Speedup measurements clearly show the efficiency of using domain decomposition methods for real-time simulation of wave physics. Copyright (C) 2015 John Wiley & Sons, Ltd.
Suzaku is a pattern programming framework that enables programmers to create pattern-based parallel MPI programs without writing the MPI message-passing code implicit in the patterns. The purpose of this framework is ...
详细信息
ISBN:
(纸本)9781509036837
Suzaku is a pattern programming framework that enables programmers to create pattern-based parallel MPI programs without writing the MPI message-passing code implicit in the patterns. The purpose of this framework is to simplify message-passing programming and create better structured programs based upon established parallel design patterns. The focus for developing Suzaku is on teaching parallel programming. This paper covers the main features of Suzaku and describes our experiences using it in parallel programming classes.
In this paper we develop a theory of visualizing a parallel execution by the entropy of the phase space induced by its traces. This metric is then shown to be able to both theoretically and practically find program is...
详细信息
ISBN:
(纸本)9781509036837
In this paper we develop a theory of visualizing a parallel execution by the entropy of the phase space induced by its traces. This metric is then shown to be able to both theoretically and practically find program issues, such as starvation due to data in one of its threads.
Graphics processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the ...
详细信息
ISBN:
(纸本)9781509036837
Graphics processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
暂无评论