Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand for High-End Computing (HEC) ...
详细信息
ISBN:
(纸本)0769523129
Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand for High-End Computing (HEC) systems with sustained performance requirements at a petaflop scale and beyond. Despite the very pessimistic (if not negative) views on parallel computing systems that have prevailed in 1990s, there seems to be no other viable alternatives for such HEC systems. In this talk, we present a fresh look at the problems facing the design of petascale parallel computing systems. We review several fundamental issues that such HEC parallel computing systems must resolve. These issues include: execution models that support dynamic and adaptive multithreading, fine-grain synchronization, and global name-space and memory consistency. Related issues in parallel programming, dynamic compilation models, and system software design will also be discussed. Present solutions and future direction will be discussed based on (1) application demand (e.g. computation biology and others), (2) the recent trend as demonstrated by the HTMT, HPCS, and the Blue-Gene Cyclops (e.g. Cyclops-64) architectures, and (3) a historical perspective on influential models such as dataflow, along with concepts learned from these models.
Summary form is only given. Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand f...
详细信息
Summary form is only given. Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand for High-End Computing (HEC) systems with sustained performance requirements at a petaflop scale and beyond. Despite the very pessimistic (if not negative) views on parallel computing systems that have prevailed in 1990s, there seems to be no other viable alternatives for such HEC systems. In this talk, we present a fresh look at the problems facing the design of petascale parallel computing systems. We review several fundamental issues that such HEC parallel computing systems must resolve. These issues include: execution models that support dynamic and adaptive multithreading, fine-grain synchronization, and global name-space and memory consistency. Related issues in parallel programming, dynamic compilation models, and system software design will also be discussed. Present solutions and future direction will be discussed based on (1) application demand (e.g. computation biology and others), (2) the recent trend as demonstrated by the HTMT, HPCS, and the Blue-Gene Cyclops (e.g. Cyclops-64) architectures, and (3) a historical perspective on influential models such as dataflow, along with concepts learned from these models.
The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and...
详细信息
ISBN:
(纸本)9781467309745
The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the way the sharing of those resources is arbitrated. On such machines, task scheduling is of paramount importance to orchestrate a satisfactory distribution of tasks with an efficient utilization of resources, especially when fine-grain parallelism is desired or required. In the past, the primary focus of scheduling techniques has been on achieving load balancing and reducing overhead with the aim to increase total performance. This focus has resulted in a scheduling paradigm where Static Scheduling (SS) is preferred to Dynamic Scheduling (DS) for highly regular and embarrassingly parallel applications running on homogeneous architectures. We have revisited the task scheduling problem for these types of applications under the scenario imposed by many-core architectures to investigate whether or not there exists scenarios where DS is better than SS. Our main contribution is the idea that, for highly regular and embarrassingly parallel applications, DS is preferable to SS in some situations commonly found in many-core architectures. We present experimental evidence that shows how the performance of SS is degraded by the new environment on many-core chips. We analyze three reasons that contribute to the superiority of DS over SS on many-core architectures under the situations described: 1) A uniform mapping of work to processors without considering the granularity of tasks is not necessarily scalable under limited amounts of work. 2) The presence of shared resources (i.e. the crossbar switch) produces unexpected and stochastic variations on the duration of tasks that SS is unable to manage properly. 3) Hardware features, such as in-memory atomic operations, greatly contribute to decrea
We present an automatic approach for prefetching data for linked list data structures. The main idea is based on the observation that linked list elements are frequently allocated at constant distance from one another...
详细信息
Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just ...
详细信息
Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.
暂无评论