For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel (R) Xeon Phi (TM) been included (Tianhe-2and Stampede). For example, GPUs are in 14% of systems ...
详细信息
ISBN:
(纸本)9781450347556
For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel (R) Xeon Phi (TM) been included (Tianhe-2and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi (TM) is in 6%. Intel (R) came out with Xeon Phi (TM) to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less energy consumption. Maximum Xeon Phi (TM) execution-time performance requires that programs have high data parallelism and good scalability, and use parallel algorithms. And, improved Phi (TM) power performance and throughput can be achieved by reducing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to users with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this results in better performance and for 37% using less than half of the available cores results in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the optimal number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analysis, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, high use of data bandwidth, and, to a lesser exte
Chapel supports distributed computing with an underlying PGAS memory address space. While it provides abstractions for writing simple and elegant distributed code, the type system currently lacks a notion of locality ...
详细信息
ISBN:
(纸本)9781509052141
Chapel supports distributed computing with an underlying PGAS memory address space. While it provides abstractions for writing simple and elegant distributed code, the type system currently lacks a notion of locality i.e. a description of an object's access behavior in relation to its actual location. This often necessitates programmer intervention to avoid redundant non-local data access. Moreover, due to insufficient locality information the compiler ends up using "wide" pointers-that can point to non-local data-for objects referenced in an otherwise completely local manner, adding to the runtime overhead. In this work we describe CoMD-Chapel, our distributed Chapel implementation of the CoMD benchmark. We demonstrate that optimizing data access through replication and localization is crucial for achieving performance comparable to the reference implementation. We discuss limitations of existing scope-based locality optimizations and argue instead for a more general (and robust) type-based approach. Lastly, we also evaluate code performance and scaling characteristics. The fully optimized version of CoMD-Chapel can perform to within 62%-87% of the reference implementation.
Over the past few decades, load flow algorithms for radial distribution networks have been an area of interest for researches, which has led to improvement in the approach and results for the problem. Different proced...
详细信息
ISBN:
(数字)9789811001291
ISBN:
(纸本)9789811001291;9789811001277
Over the past few decades, load flow algorithms for radial distribution networks have been an area of interest for researches, which has led to improvement in the approach and results for the problem. Different procedures and algorithms have been followed in lieu of performance enhancement in terms of simplicity of implementation, execution time, and memory space requirements. The implementation of load flow algorithm using CUDA parallel programming architecture for a radial distribution network is discussed. The computations involved in serial algorithm for load current, branch impedances, etc., have been parallelized using CUDA programming model. The end result will be an improvement in execution time of the algorithm as compared to the running time of the algorithm over CPU. Finally, a comparison has been drawn between the serial and parallel approaches, where an improvement in execution time has been shown over the functions involved in computations.
In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based ...
详细信息
ISBN:
(纸本)9781467387767
In this paper we present a novel approach for functional-style programming of distributed-memory clusters, targeting data-centric applications. The programming model proposed is purely sequential, SPMD-free and based on high-level functional features introduced since C++11 specification. Additionally, we propose a novel cluster-as-accelerator design principle. In this scheme, cluster nodes act as general interpreters of user-defined functional tasks over node-local portions of distributed data structures. We envision coupling a simple yet powerful programming model with a lightweight, locality aware distributed runtime as a promising step along the road towards high-performance data analytics, in particular under the perspective of the upcoming exascale era. We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.
Over the last two decades, researchers developed many software, hardware, and hybrid Transactional Memories (TMs) with various APIs and semantics. However, reduced performance when exposed to high contention loads is ...
详细信息
ISBN:
(纸本)9781509012244
Over the last two decades, researchers developed many software, hardware, and hybrid Transactional Memories (TMs) with various APIs and semantics. However, reduced performance when exposed to high contention loads is still the major downside of all the TMs. Although many strategies and methods have been proposed, contention management and transaction scheduling still remains an open area of research. An important piece of unsolved contention management punk is plausible transaction execution time estimation. In this paper we proposed two methods for estimating transaction execution times, namely the method based on log-normal distribution and the method based on gamma distribution. Experimental results presented in this paper indicate that the method based on log-normal distribution has better estimation accuracy than the method based on gamma distribution. Even more importantly, the method based on log-normal distribution uses 10 times shorter sliding windows and its complexity is much lower than for the method based on gamma distribution, thus it is faster and requires less electrical power.
The article presents a method of measuring energy consumption with the NVIDIA graphics processing unit and energy consumption in response to the number of operating units. The architecture of graphics processing unit ...
详细信息
ISBN:
(纸本)9781509013227
The article presents a method of measuring energy consumption with the NVIDIA graphics processing unit and energy consumption in response to the number of operating units. The architecture of graphics processing unit has been considered as well as the method of energy consumption of GPU. The experiment is based on multiplication of matrices. Brief results and dependency of counting time from number of computing elements are also demonstrated. A simple way to understand the difference between a CPU and GPU is to compare how they process tasks. The CPU consists of a few cores optimized for sequential serial processing while the GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.
In our previous paper [17], the parallel realization of Restricted Boltzman Machines (RBMs) was discussed. This research confirmed a potential usefulness of Intel MIC parallel architecture for implementation of RBMs. ...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
In our previous paper [17], the parallel realization of Restricted Boltzman Machines (RBMs) was discussed. This research confirmed a potential usefulness of Intel MIC parallel architecture for implementation of RBMs. In this work, we investigate how the Intel MIC and Intel CPU architectures can be applied to implement the complete learning process using Deep Belief Networks (DBNs), which layers correspond to RBMs. The learning procedure is based on the matrix approach, where learning samples are grouped into packages, and represented as matrices. This approach is now applied for both the initial learning, and fine-tuning stages of learning. The influence of the package size on the accuracy of learning, as well as on the performance of computations are studied using conventional CPU and Intel Xeon Phi architectures.
In this paper, we propose a new technique to recommend programmers with high quality parallel code that are similar to a given sequential code. This is done by transforming well-grounded parallel code A into its seque...
详细信息
ISBN:
(纸本)9781450342056
In this paper, we propose a new technique to recommend programmers with high quality parallel code that are similar to a given sequential code. This is done by transforming well-grounded parallel code A into its sequential equivalent B, storing them (A->B) into database, given a sequential code C, search the database for syntactic or semantic similar code B and retrieve its parallel version code A, which can be used as the replacement or reference for the original code C. We also outline our solutions towards realizing this technique and present a preliminary study that shows promising results.
Many real-world systems and networks are modeled and analyzed using various random graph models. These models must incorporate relevant properties such as degree distribution and clustering coefficient. Many models, s...
详细信息
ISBN:
(纸本)9781467388153
Many real-world systems and networks are modeled and analyzed using various random graph models. These models must incorporate relevant properties such as degree distribution and clustering coefficient. Many models, such as the Chung-Lu (CL), stochastic Kronecker, stochastic block model (SBM), and block two-level Erdos-Renyi (BTER) models have been devised to capture those properties. However, the generative algorithms for these models are mostly sequential and take prohibitively long time to generate large-scale graphs. In this paper, we present a novel time and space efficient algorithmic method to generate random graphs using CL, BTER, and SBM models. First, we present an efficient sequential algorithm and an efficient distributed-memory parallel algorithm for the CL model. Our sequential algorithm takes O(m) time and O(Lambda) space, where m and. are the number of edges and distinct degrees, and our parallel algorithm takes O (m/P + Lambda + P) time w.h.p. and O(Lambda) space using P processors. These algorithms are almost time optimal since any sequential and parallel algorithms need at least O(m) and O(m P) time, respectively. Our algorithms outperform the best known previous algorithms by a significant margin in terms of both time and space. Experimental results on various large-scale networks show that both of our sequential and parallel algorithms require 400-15000 times less memory than the existing sequential and parallel algorithms, respectively, making our algorithms suitable for generating very large-scale networks. Moreover, both of our algorithms are about 3-4 times faster than the existing sequential and parallel algorithms. Finally, we show how our algorithmic method also leads to efficient parallel and sequential algorithms for the SBM and BTER models.
Multi-core processors are very common in the form of dual-core and quad-core processors. To take advantage of multiple cores, parallel programs are written. Existing legacy applications are sequential and runs on mult...
详细信息
ISBN:
(纸本)9781509019878
Multi-core processors are very common in the form of dual-core and quad-core processors. To take advantage of multiple cores, parallel programs are written. Existing legacy applications are sequential and runs on multiple cores utilizing only one core. Such applications should be either rewritten or parallelized to make efficient use of multiple cores. Manual parallelization requires huge efforts in terms of time and money and hence there is a need for automatic parallelization. Automatic Code parallelizer using OpenMP automates the insertion of compiler directives to facilitate parallel processing on multi-core shared memory machines. The proposed tool converts an input sequential C source code into a multi-threaded parallel C source code. The tool supports multi-level parallelization with the generation of nested OpenMP constructs. The proposed scheme statically decomposes a sequential C program into coarse grain tasks, analyze dependency among tasks and generates OpenMP parallel code. The focus is on coarse-grained task parallelism to improve performance beyond the limits of loop parallelism. Due to the broad support of OpenMP standard, the generated OpenMP codes can run on a wide range of SMP machines and may result in a performance improvement.
暂无评论