New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is mor...
详细信息
ISBN:
(纸本)9781728142227
New generations of high-performance computing applications depend on an increasing number of components to satisfy their growing demand for computation. On such large systems, the execution of long-running jobs is more likely affected by component failures. Failure classes vary from frequent transient memory faults to rather rare correlated node errors. Multilevel checkpoint/restart has been introduced to proactively cope with failures at different levels. Writing checkpoints on slower stable devices, which survive fatal failures, causes more overhead than writing them on fast devices (main memory or local SSD), which, however, only protect against light faults. Given a graph of the components of a particular storage hierarchy mapping their fault-domains and their expected mean time to failure (MTTF), we optimize the checkpoint frequencies for each level of the storage hierarchy (multilevel checkpointing) to minimize the overhead and runtime of a given job. We reduce the checkpoint/restart overhead of large dataintensive jobs compared to state-of-the-art solutions on multilevel checkpointing by up to 10 percent in the investigated cases. The improvement increases further with growing checkpoint sizes.
Nowadays, GPUs are known as one of the most important, most remarkable, and perhaps most popular computing platforms. In recent years, GPUs have increasingly been considered as co-processors and accelerators. Along wi...
详细信息
ISBN:
(纸本)9781728150758
Nowadays, GPUs are known as one of the most important, most remarkable, and perhaps most popular computing platforms. In recent years, GPUs have increasingly been considered as co-processors and accelerators. Along with growing technology, Graphics Processing Units (GPUs) with more advanced features and capabilities are manufactured and launched by the world's largest commercial companies. Unified memory is one of these new features introduced on the latest generations of Nvidia GPUs which allows programmers to write a program considering the uniform memory shared between CPU and GPU. This feature makes programming considerably easier. The present study introduces this new feature and its attributes. In addition, a model is proposed to predict the execution time of applications if using unified memory style programming based on the information of non-unified style implementation. The proposed model can predict the execution time of a kernel with an average accuracy of 87.60%.
SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through differ...
详细信息
ISBN:
(纸本)9783030049188;9783030049171
SHMEM has a long history as a parallel programming model. It is extensively used since 1993, starting from Cray T3D systems. For the past two decades SHMEM library implementation in Cray systems evolved through different generations. The current generation of the SHMEM implementation for Cray XC and XK systems is called Cray SHMEM. It is a proprietary SHMEM implementation from Cray Inc. In this work, we provide an in-depth analysis of need for a new SHMEM implementation and then introduce the next evolution of Cray SHMEM implementation for current and future generation Cray systems. We call this new implementation Cray OpenSHMEMX. We provide brief design overview, along with a review of functional and performance differences in Cray OpenSHMEMX comparing against the existing Cray SHMEM implementation.
We propose an algorithm that is fully parallel and has linear time complexity for soft body simulation that addresses three principal issues;Visual Quality, Performance and Ease of use. It works using the power of pre...
详细信息
ISBN:
(纸本)9781728129334;9781728129327
We propose an algorithm that is fully parallel and has linear time complexity for soft body simulation that addresses three principal issues;Visual Quality, Performance and Ease of use. It works using the power of precomputed collision result look-up data and basic approach of shape matching. Since data driven shape matching approach only uses user generated precomputed collision results, deformation results cannot be unexpected. This creates visual quality and improves ease of use. Also, usage of these look-up data opens ways to improve Performance. In our tests, we achieved direct linear speed up depending on the processor's core count.
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflowexecution must answer the following question: How can parallelism between two dependent n...
详细信息
ISBN:
(纸本)9781450368131
Dataflow execution models are used to build highly scalable parallel systems. A programming model that targets parallel dataflowexecution must answer the following question: How can parallelism between two dependent nodes in a dataflow graph be exploited? This is difficult when the dataflow language or programming model is implemented by a monad, as is common in the functional community, since expressing dependence between nodes by a monadic bind suggests sequential execution. Even in monadic constructs that explicitly separate state from computation, problems arise due to the need to reason about opaquely defined state. Specifically, when abstractions of the chosen programming model do not enable adequate reasoning about state, it is difficult to detect parallelism between composed stateful computations. In this paper, we propose a programming model that enables the composition of stateful computations and still exposes opportunities for parallelization. We also introduce smap, a higher-order function that can exploit parallelism in stateful computations. We present an implementation of our programming model and smap in Haskell and show that basic concepts from functional reactive programming can be built on top of our programming model with little effort. We compare these implementations to a state-of-the-art approach using monad-par and LVars to expose parallelism explicitly and reach the same level of performance, showing that our programming model successfully extracts parallelism that is present in an algorithm. Further evaluation shows that smap is expressive enough to implement parallel reductions and our programming model resolves short-comings of the stream-based programming model for current state-of-theart big data processing systems.
Taking advantage of the growing number of cores in super-computers to increase the scalability of parallel programs is an increasing challenge. Many advanced profiling tools have been developed to assist programmers i...
详细信息
ISBN:
(纸本)9783030178727;9783030178710
Taking advantage of the growing number of cores in super-computers to increase the scalability of parallel programs is an increasing challenge. Many advanced profiling tools have been developed to assist programmers in the process of analyzing data related to the execution of their program. Programmers can act upon the information generated by these data and make their programs reach higher performance levels. However, the information provided by profiling tools is generally designed to optimize the program for a specific execution environment, with a target number of cores and a target problem size. A code optimization driven towards scalability rather than specific performance requires the analysis of many distinct execution environments instead of details about a single environment. With the goal of providing more useful information for the analysis and optimization of code for parallel scalability, this work introduces the PaScal Viewer tool. It presents an novel and productive way to visualize scalability trends of parallel programs. It consists of four diagrams that offers visual support to identify parallel efficiency trends of the whole program, or parts of it, when running on scaling parallel environments with scaling problem sizes.
Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing...
详细信息
ISBN:
(纸本)9783030105495;9783030105488
Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes' annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.
Peachy parallel assignments are high-quality assignments for teaching parallel and distributed computing. They have been successfully used in class and are selected on the basis of their suitability for adoption and f...
详细信息
ISBN:
(纸本)9781728159751
Peachy parallel assignments are high-quality assignments for teaching parallel and distributed computing. They have been successfully used in class and are selected on the basis of their suitability for adoption and for being cool and inspirational for students. Here we present a fire fighting simulation, thread-to-core mapping on NUMA nodes, introductory cloud computing, interesting variations on prefix-sum, searching for a lost PIN, and Big Data analytics.
We propose a stock market software architecture extended by a graphics processing unit, which employs parallel programming paradigm techniques to optimize long-running tasks like computing daily trends and performing ...
详细信息
ISBN:
(数字)9783030239763
ISBN:
(纸本)9783030239763;9783030239756
We propose a stock market software architecture extended by a graphics processing unit, which employs parallel programming paradigm techniques to optimize long-running tasks like computing daily trends and performing statistical analysis of stock market data in realtime. The system uses the ability of Nvidia's CUDA parallel computation application programming interface (API) to integrate with traditional web development frameworks. The web application offers extensive statistics and stocks' information which is periodically recomputed through scheduled batch jobs or calculated in real-time. To illustrate the advantages of using many-core programming, we explore several use-cases and evaluate the improvement in performance and speedup obtained in comparison to the traditional approach of executing long-running jobs on a central processing unit (CPU).
Powerlists are recursive data structures that together with their associated algebraic theories could offer both a methodology to design parallel algorithms and parallel programming abstractions to ease the developmen...
详细信息
ISBN:
(纸本)9781728116440
Powerlists are recursive data structures that together with their associated algebraic theories could offer both a methodology to design parallel algorithms and parallel programming abstractions to ease the development of parallel applications This has been also proved by a concrete development of such a framework that allows easy, efficient, and reliable implementation of Java parallel programs on shared memory systems. The paper presents a highly scalable version of this framework by extending it to distributed memory systems based on an MPI implementation. Through this extension we may use the framework to develop Java parallel programs also on distributed memory systems such as clusters. The design of the framework enables flexibility in defining the appropriate execution type depending on the execution system and its characteristics. Therefore, it is possible to choose MPI execution (ithat also could be combined with multithreading) if the available system includes an MPI platform, or simple multithreading execution. Examples are given and performance experiments are conducted. The performance analysis of these applications emphasises the utility and the efficiency of this framework extension.
暂无评论