Physical synthesis engines need to embrace all available parallelism to cope with the increasing complexity of modern designs and still offer high quality of results. To achieve this goal, the involved algorithms need...
详细信息
Physical synthesis engines need to embrace all available parallelism to cope with the increasing complexity of modern designs and still offer high quality of results. To achieve this goal, the involved algorithms need to be expressed in a way that facilitates fast execution time across a range of computing platforms. In this work, we introduce a task-based parallel programming template that can be used for speeding up timing and power optimization. This approach utilizes all available parallelism and enables significant speedup relative to custom multithreaded approaches. task-basedparallelism is applied to all parts of the optimization engine covering also parts that are traditionally executed serially for preserving maximum timing accuracy. Using taskflow as the parallelprogramming and execution engine, we achieved a speedup of 1.7x to 2.8x for gate sizing optimizations on the ISPD13 benchmarks with marginal extra leakage power relative to state-of-the-art multithreaded gate sizers. This result was supported by two dynamic heuristics that restrict the number of examined gate sizes and simplify local timing updates. Both heuristics tradeoff additional runtime reduction with marginal leakage power increases.
A popular approach to program scalable irregular applications is Asynchronous Many-task (AMT) Program-ming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-j...
详细信息
A popular approach to program scalable irregular applications is Asynchronous Many-task (AMT) Program-ming. Here, programs define tasks according to task models such as dynamic independent tasks (DIT) or nested fork-join (NFJ). We consider cluster AMTs, in which a runtime system maps the tasks to worker threads in multiple ***, dynamic load balancing can be achieved via cooperative work stealing, coordinated work stealing, or work sharing. A well-performing cooperative work stealing variant is the lifeline scheme. While previous implementations of this scheme are restricted to single-worker processes, a recent hybrid extension combines it with intra-process work sharing between multiple workers. The hybrid scheme, which was proposed for both DIT and NFJ, comes at the price of a higher *** paper investigates whether this complexity is indispensable for multi-worker processes by contrasting the hybrid scheme with a novel pure work stealing extension of the lifeline scheme to multiple workers. We independently implemented the extension for DIT and NFJ. In experiments based on four benchmarks, we observed the pure scheme to be on a par or even outperform the hybrid one by up to 18% for DIT and up to 5% for *** on this main result, we studied a modification of the pure scheme, which prefers local over global victims, and more heavily loaded over less loaded ones. The modification improves the performance of the pure scheme by up to 15%. Finally, we explored whether the lifeline scheme can profit from a change to coordinated work stealing. We developed a coordinated multi-worker implementation for DIT and observed a performance improvement over the cooperative scheme by up to 17%.
Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems r...
详细信息
Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. task-based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task-parallelprogramming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task-parallelprogramming, and show that it is competitive. (C) 2019 Elsevier B.V. All rights reserved.
Resource elasticity allows to dynamically change the resources of running jobs, which may significantly improve the throughput on supercomputers. Elasticity requires support from both job schedulers and user applicati...
详细信息
ISBN:
(纸本)9781450384414
Resource elasticity allows to dynamically change the resources of running jobs, which may significantly improve the throughput on supercomputers. Elasticity requires support from both job schedulers and user applications. Whereas the adaptation of traditional programs requires additional programmer effort, task-based programs can be made elastic in a transparent way. In this paper, we propose a corresponding technique for implementation in a runtime system. We refer to a work stealing-based runtime for clusters, which uses the lifeline scheme for victim selection, combines inter-node work stealing with intra-node work sharing, and handles dynamic independent tasks, i.e., tasks that may spawn child tasks but do not otherwise cooperate. We experimentally assess the elasticity overhead of our scheme and find that adding/releasing up to 64 nodes takes less than 0.5 seconds. This value is determined with the help of a new formula that estimates the overhead-free running time of work-stealing programs with a changing number of workers. Using this result, we then quantify the gain of deploying elastic jobs. For that, we simulate the execution of job sets that contain some percentage of elastic jobs on two hypothetical supercomputers. We use an existing elastic job scheduler, which we concretize, e.g. by a new heuristic to determine the minimum, maximum, and preferred number of nodes for a job. Results show that the makespan can be reduced by up to 20% if most jobs are elastic.
Irregularity in parallel programs is often tackled by dynamic load balancing. Two well-known techniques, namely work sharing and work stealing, were recently combined into a hybrid scheme, which uses the former for sh...
详细信息
ISBN:
(纸本)9781728196664
Irregularity in parallel programs is often tackled by dynamic load balancing. Two well-known techniques, namely work sharing and work stealing, were recently combined into a hybrid scheme, which uses the former for shared-memory, and the latter for distributed-memory load balancing. Nested fork-join programs are an important type of task-basedparallel programs, where tasks generate child tasks giving rise to an execution tree. The hybrid scheme was originally designed for a different task model. We transfer the hybrid scheme to nested fork-join programs, and compare different variants of the scheme, called load balancing policies. Experimental results show that the considered policies perform well for the used benchmarks with no significant difference between them.
With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with ...
详细信息
ISBN:
(纸本)9781665435772
With the advent of exascale computing, issues such as application irregularity and permanent hardware failure are growing in importance. Irregularity is often addressed by task-based parallel programming coupled with work stealing. At the task level, resilience can be provided by two principal approaches, namely checkpointing and supervision. For both, particular algorithms have been worked out recently. They perform local recovery and continue the program execution on a reduced set of resources. The checkpointing algorithms regularly save task descriptors explicitly, while the supervision algorithms exploit their natural duplication during work stealing and may be coupled with steal tracking to minimize the number of task re-executions. Thus far, the two groups of algorithms have been targeted at different task models: checkpointing algorithms at dynamic independent tasks, and supervision algorithms at nested fork-join programs. This paper transfers the most advanced supervision algorithm to the dynamic independent tasks model, thus enabling a comparison between checkpointing and supervision. Our comparison includes experiments and running time predictions. Results consistently show typical resilience overheads below 1% for both approaches. The overheads are lower for supervision in practically relevant cases, but checkpointing takes over for order millions of processes.
Today large-scale simulation applications are becoming common in research and industry. A significant fraction of them run on multi-core clusters. Current parallel simulation kernels use multi-process and multi-thread...
详细信息
ISBN:
(纸本)9781538649756
Today large-scale simulation applications are becoming common in research and industry. A significant fraction of them run on multi-core clusters. Current parallel simulation kernels use multi-process and multi-thread to exploit inter-node parallelism and intra-node parallelism on multi-core clusters. We exploit task-base parallelism in parallel discrete event simulation (PDES) kernels, which is more fine-grained than thread-level and process-level parallelism. In our system, every simulation event is wrapped to a task. Work-stealing task scheduling scheme is applied to achieve dynamic load balancing among the multi-cores, and a graph partitioning approach is applied in partitioning simulation entities among the cluster nodes. Experimental results show that our PDES kernel outperforms existing PDES kernels by fully exploiting taskparallelism.
Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. Th...
详细信息
ISBN:
(数字)9781118028124
ISBN:
(纸本)9781118546031
Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-basedprogramming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallelprogramming techniques, such as SIMD and vectorization. Teaches programmers professional-level, task-based, parallelprogramming with C#, .NET 4, and Visual Studio 2010 Covers concurrent collections, coordinated data structures, PLINQ, thread pools, asynchronous programming model, Visual Studio 2010 debugging, and parallel testing and tuning Explores vectorization, SIMD instructions, and additional parallel libraries Master the tools and technology you need to develop thread-safe concurrent applications for multi-core systems, with Professional parallelprogramming with C#.
暂无评论