On-demand hardware resources provisioning is an efficient way to save energy in traditional data centers. However, when workloads burst and exceed the capacity of provided resources, the capacity will temporary defici...
详细信息
ISBN:
(纸本)9781538637906
On-demand hardware resources provisioning is an efficient way to save energy in traditional data centers. However, when workloads burst and exceed the capacity of provided resources, the capacity will temporary deficit. That's because a specific time is needed to increase the quantity of resources. Thus, performance degradation is incurred. To alleviate this problem, this paper proposes a peak load regulation method to promote the QoS of workloads for traditional energy-efficient DCs. In this method, overloaded workloads (peak loads) are regulated to improve the response time of critical requests and increase the number of QoS-guaranteed requests. Experimental results show that, with this method the energy consumption of the data center can be reduced by about 25% compared with the baseline. What's more, this method can significantly promote the QoS of workloads.
Many parallel scientific applications spend a significant amount of time reading and writing data files. Collective I/O operations allow to optimize the file access of a process group by redistributing data across pro...
详细信息
ISBN:
(纸本)9781728174457
Many parallel scientific applications spend a significant amount of time reading and writing data files. Collective I/O operations allow to optimize the file access of a process group by redistributing data across processes to match the data layout on the file system. In most parallel I/O libraries, the implementation of collective I/O operations is based on the two-phase I/O algorithm, which consists of a communication phase and a file access phase. This papers evaluates various design options for overlapping two internal cycles of the two-phase I/O algorithm, and explores using different data transfer primitives for the shuffle phase, including non-blocking two-sided communication and multiple versions of one-sided communication. The results indicate that overlap algorithms incorporating asynchronous I/O outperform overlapping approaches that only rely on non-blocking communication. However, in the vast majority of the testcases one-sided communication did not lead to performance improvements over two-sided communication.
Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. applications executing on such many-core platfor...
详细信息
ISBN:
(纸本)9780769561493
Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth memory (HBM) in addition to the high capacity but low bandwidth DDR4. Other architectures like Nvidia's Pascal GPU also expose similar stacked DRAM. In architectures with heterogeneity in memory types within a node, efficient allocation and data movement can result in improved performance and energy savings in future systems if all the data requests are served from the high bandwidth memory. In this paper, we propose a memory-heterogeneity aware runtime system which guides data prefetch and eviction such that data can be accessed at high bandwidth for applications whose entire working set does not fit within the high bandwidth memory and data needs to be moved among different memory types. We implement a data movement mechanism managed by the runtime system which allows applications to run efficiently on architectures with heterogeneous memory hierarchy, with trivial code changes. We show upto 2X improvement in execution time for Stencil3D and Matrix Multiplication which are important HPC kernels.
The subgraph enumeration problem asks us to find all subgraphs of a target graph that are isomorphic to a given pattern graph. Determining whether even one such isomorphic subgraph exists is NP-complete-and therefore ...
详细信息
ISBN:
(纸本)9780769561493
The subgraph enumeration problem asks us to find all subgraphs of a target graph that are isomorphic to a given pattern graph. Determining whether even one such isomorphic subgraph exists is NP-complete-and therefore finding all such subgraphs (if they exist) is a time-consuming task. Subgraph enumeration has applications in many fields, including biochemistry and social networks, and interestingly the fastest algorithms for solving the problem for biochemical inputs are sequential. Since they depend on depth-first tree traversal, an efficient parallelization is far from trivial. Nevertheless, since important applications produce data sets with increasing difficulty, parallelism seems beneficial. We thus present here a shared-memory parallelization of the state-of-the-art subgraph enumeration algorithms RI and RIDS (a variant of RI for dense graphs) by Bonnici et al. [BMC Bioinformatics, 2013]. Our strategy uses work stealing and our implementation demonstrates a significant speedup on real-world biochemical data-despite a highly irregular data access pattern. We also improve RI-DS by pruning the search space better;this further improves the empirical running times compared to the already highly tuned RI-DS.
Significant performance gains have been reported by exploiting the specialized characteristics of hybrid computing architectures for a number of streaming applications. While it is straightforward to physically constr...
详细信息
ISBN:
(纸本)9781424416936
Significant performance gains have been reported by exploiting the specialized characteristics of hybrid computing architectures for a number of streaming applications. While it is straightforward to physically construct these hybrid systems, application development is often quite difficult. We have built an application development environment, Auto-Pipe, that targets streaming applications deployed on hybrid architectures. Here, we describe some of the current and future characteristics of the Auto-Pipe environment that facilitate an understanding of the performance of an application that is deployed on a hybrid system.
With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to...
详细信息
ISBN:
(纸本)9781479986484
With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to checkpointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application's processes. In this paper, we introduce intra-parallelization, a solution that avoids replicating all computation by introducing work-sharing between replicas. We show on a representative set of benchmarks that intra-parallelization allows achieving more than 50% efficiency without compromising fault tolerance.
The majority of mainstream programming languages support parallel computing via extended libraries that require restructuring of sequential code. Library-based features are portable, but tend to be verbose and usually...
详细信息
ISBN:
(纸本)9780769561493
The majority of mainstream programming languages support parallel computing via extended libraries that require restructuring of sequential code. Library-based features are portable, but tend to be verbose and usually reduce the understandability and modifiability of code. On the contrary, approaches with language constructs promote simple code structures, hide the complexity of parallelization and avoid boilerplate code. However, language constructs normally impose additional development concepts and compilation requirements that may sacrifice the ease-of-use and portability. Therefore, frameworks that offer simple and intuitive concepts and constructs that are recognized by the standard compilers of a language can gain priority over other approaches. In this paper we discuss @PT (Annotation parallel Task), a parallel programming framework that proposes Java annotations, standard Java components, as its language constructs. @PT takes an intuitive object-oriented approach on asynchronous execution of tasks, and has a special focus on GUI-responsive applications. This paper presents the annotation-based programming interface of the framework and its fundamental parallelization concepts. Furthermore, it studies @PT in different parallel programming patterns, and evaluates its efficiency by comparing @PT with other Java parallelization approaches in a set of standard benchmarks.
Computers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of...
详细信息
ISBN:
(纸本)9780769546766
Computers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question 'what is the projected performance benefit from accelerators for the workloads of interest?'. We address this question by way of a performance-modeling framework that predicts realizable application performance on accelerators rapidly and accurately without going to the considerable effort of porting and tuning. The modeling framework first automatically identifies commonly found compute patterns in scientific applications which we term idioms, which may benefit by accelerator technology. Next the framework models the predicted speedup of those idioms if they were to be ported to and run on hardware accelerators. As a proof of concept we characterize two kinds of accelerators 1) the FPGA accelerators on a Convey HC-1 system and 2) an NVIDIA FERMI GPU accelerator. We model performance of the idioms gather/scatter and stream and our predictions show that where these occur in two full-scale HPC applications, Milc and HYCOM, gather/scatter speeds up by as much as 15X, and stream by as much as 14X, whereas the overall compute time of Milc improves by 3.4% and HYCOM by 20%.
The sophisticated nature of parallel computing concepts makes parallel programming challenging. This has encouraged higher-level frameworks that conceal much of the complications behind abstraction layers. Paradigms i...
详细信息
ISBN:
(纸本)9781538643686
The sophisticated nature of parallel computing concepts makes parallel programming challenging. This has encouraged higher-level frameworks that conceal much of the complications behind abstraction layers. Paradigms in this category are mostly performance centric, and do not share the same sentiments for the robustness of asynchronous executions. This is while current applications demand consistency in addition to fast performance. Therefore, programming environments that offer high-level support for asynchronous exception handling will have higher chances for popularity. This paper discusses our latest enhancements to @PT, a parallel programming environment that is based on Java annotations. The proposed concept promotes the robustness of parallelized programs by adhering to the familiar exception handling standards of sequential code, and reducing the asynchronous execution concerns at the API level. This study suggests that the concept simplifies efficient management of asynchronous exceptions, which appears to be a challenge in parallel programming.
We discuss early results with Toucan, a sourceto- source translator that automatically restructures C/C++ MPI applications to overlap communication with computation. We co-designed the translator and runtime system to...
详细信息
ISBN:
(纸本)9781538639146
We discuss early results with Toucan, a sourceto- source translator that automatically restructures C/C++ MPI applications to overlap communication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants.
暂无评论