As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular t...
详细信息
ISBN:
(纸本)9781424437511
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a many automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing.
The ever-increasing supercomputer architectural complexity emphasizes the need for high-level parallel programming paradigms. Among such paradigms, task-based programming manages to abstract away much of the architect...
详细信息
ISBN:
(纸本)9781509036820
The ever-increasing supercomputer architectural complexity emphasizes the need for high-level parallel programming paradigms. Among such paradigms, task-based programming manages to abstract away much of the architecture complexity while efficiently meeting the performance challenge, even at large scale. Dynamic run-time systems are typically used to execute task-based applications, to schedule computation resource usage and memory allocations. While computation scheduling has been well studied, the dynamic management of memory resource subscription inside such run-times has however been little explored. This paper studies the cooperation between a task-based distributed application code and a run-time system engine to control the memory subscription levels throughout the execution. We show that the task paradigm allows to control the memory footprint of the application by throttling the task submission flow rate, striking a compromise between the performance benefits of anticipative task submission and the resulting memory consumption. We illustrate the benefits of our contribution on a compressed dense linear algebra distributed application.
The Petascale Cray XT5 system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) shares a number of system and software features with its predecessor, the Cray XT4 system including the qua...
详细信息
ISBN:
(纸本)9781424437511
The Petascale Cray XT5 system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) shares a number of system and software features with its predecessor, the Cray XT4 system including the quad-core AMD processor and a multi-core aware MPI library. We analyze performance of scalable scientific applications on the quad-core Cray XT4 system as part of the early system access using a combination of micro-benchmarks and Petascale ready applications. Particularly, we evaluate impact of key changes that occurred during the dual-core to quad-core processor upgrade on applications behavior and provide projections for the next-generation massively-parallel platforms with multi-core processors, specifically for proposed Petascale Cray XT5 system. We compare and contrast the quad-core XT4 system features with the upcoming XT5 system and discuss strategies for improving scaling and performance for our target applications.
Tree-shaped task graphs become a paradigm to be utilized in distributed platform for various computational domains, such as the electronic structure calculations and the factorization of sparse matrices. However, the ...
详细信息
ISBN:
(纸本)9781665435741
Tree-shaped task graphs become a paradigm to be utilized in distributed platform for various computational domains, such as the electronic structure calculations and the factorization of sparse matrices. However, the scheduling of the tree-shaped task graph has been rarely studied for the more realistic heterogeneous multiprocessor platform (HEMP). This paper proposes an efficient algorithm named Partition-Allocation (PA) for parallel computing on HEMP with limited memory. Algorithm PA consists of two stages: partitioning and allocation. In the partitioning stage, a task tree is split into several subtrees. In the allocation stage, these subtrees are assigned to different processors for execution. Our algorithm PA can reduce makespan by prioritizing subtrees on the critical path, both in the partitioning and in the allocation. Based on randomly generated trees and real-world dataset, experimental results show that the proposed PA is significantly better than the latest work in terms of average makespan. The proposed algorithm can successfully reduce the average makespan by up to 67.01% on real-world dataset, and 52.35% on randomly generated trees.
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the ...
详细信息
ISBN:
(纸本)9781665497473
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the hit rate of a single machine. But in this paper, we argue that the caching decision of a distributed memory system should be performed in a cooperative manner for the parallel data analytic applications, which are commonly used by emerging technologies, such as Big Data and AI (Artificial Intelligence), to perform data mining and sophisticated analytics on larger data volume in a shorter time. A parallel data analytic job consists of multiple parallel tasks. Hence, the completion time of a job is bounded by its slowest task, meaning that the job cannot benefit from caching until all inputs of its tasks are cached. To address the problem, we proposed a cooperative caching design that periodically rearranges the cache placement among nodes according to the data access pattern while taking the task dependency and network locality into account. Our approach is evaluated by a trace-driven simulator using both synthetic workload and real-world traces. The results show that we can reduce the average completion times up to 33% compared to a non-collaborative caching polices and 25% compared to other start-of-the-art collaborative caching policies.
PDE solvers using Adaptive Mesh Refinement on block structured grids are some of the most challenging applications to adapt to massively parallel computing environments. We describe optimizations to the Chombo AMR fra...
详细信息
ISBN:
(纸本)9781424437511
PDE solvers using Adaptive Mesh Refinement on block structured grids are some of the most challenging applications to adapt to massively parallel computing environments. We describe optimizations to the Chombo AMR framework that enable it to scale efficiently to thousands of processors on the Cray XT4. The optimization process also uncovered OS-related performance variations that were not explained by conventional OS interference benchmarks. Ultimately the variability was traced back to complex interactions between the application, system software, and the memory hierarchy. Once identified, software modifications to control the variability improved performance by 20% and decreased the variation in computation time across processors by a factor of 3. These newly identified sources of variation will impact many applications and suggest new benchmarks for OS-services be developed.
As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fa...
详细信息
ISBN:
(纸本)9780769546759
As system sizes increase, the amount of time in which an application can run without experiencing a failure decreases. Exascale applications will need to address fault tolerance. In order to support algorithm-based fault tolerance, communication libraries will need to provide fault-tolerance features to the application. One important fault-tolerance operation is distributed consensus. This is used, for example, to collectively decide on a set of failed processes. This paper describes a scalable, distributed consensus algorithm that is used to support new MPI fault-tolerance features proposed by the MPI 3 Forum's fault-tolerance working group. The algorithm was implemented and evaluated on a 4,096-core Blue Gene/P. The implementation was able to perform a full-scale distributed consensus in 222 mu s and scaled logarithmically.
In this paper we discuss the runtime support required for the parallelization of unstructured data-parallelapplications on nonuniform and adaptive environments. The approach presented is reasonably general and is app...
详细信息
ISBN:
(纸本)0818675829
In this paper we discuss the runtime support required for the parallelization of unstructured data-parallelapplications on nonuniform and adaptive environments. The approach presented is reasonably general and is applicable to a wide variety of regular as well as irregular applications. We present performance results for the solution of an unstructured mesh on a cluster of heterogeneous workstations.
Motivated by multi-particle entanglement, we propose a multi-party quantum parallel teleportation scheme in quantum wireless multi-hop networks (QWMNs) investigated with four-vertex graph-based entanglement. We show a...
详细信息
ISBN:
(纸本)9781538637906
Motivated by multi-particle entanglement, we propose a multi-party quantum parallel teleportation scheme in quantum wireless multi-hop networks (QWMNs) investigated with four-vertex graph-based entanglement. We show an arbitrary single-qubit quantum teleportation with graph state between directly connected two-party and an arbitrary two-qubit quantum teleportation based on four-vertex graph state followed by the multi-parity quantum teleportation based on four-vertex graph state where source node do not need share any entangled pairs with destination node but via intermediate nodes' quantum swapping to finish the teleportation. Source node and intermediate nodes conduct von Neumann measurement and transmit their classical outcomes to the destination node independently in a parallel manner. Then destination node applies appropriate unitary operation to recover the source quantum state according to the received outcomes. This graph-based scheme extends the entanglement pattern in quantum teleportation, provides a perfect multi-party teleportation model. The proposed scheme improves the constructing flexibility of the network and reduces the communication delay, which has a wide application in quantum route selection.
The Yin-He global spectral model (YHGSM), embodies a parallel semi-Lagrangian solver and has two schemes implemented: maximum wind speed scheme and on-demand communication scheme. Maximum wind speed communication adop...
详细信息
ISBN:
(纸本)9781665435741
The Yin-He global spectral model (YHGSM), embodies a parallel semi-Lagrangian solver and has two schemes implemented: maximum wind speed scheme and on-demand communication scheme. Maximum wind speed communication adopts a single and fixed data structure, which has a large communication overhead. Although the overhead of on-demand communication is reduced, it is still pretty huge. In this paper, a novel adaptable approach is proposed in which a monthly maximum wind speed is used in the YHGSM. This approach reduces the difference between the actual wind speed and the maximum wind speed used in the model;in turn, the communication overhead in the trajectory computation is further reduced. Experiments show that in the maximum wind speed scheme and on-demand schemes, the communication overheads with the adaptive maximum wind speed are significantly reduced. In addition, in a ten-day forecast with the on-demand communication scheme, the total overhead for the semi-Lagrangian computing and the total parallel execution time are also both reduced, and the reduction ratio increases as the number of nodes increases.
暂无评论