检索结果-内蒙古大学图书馆

Power management of extreme-scale networks with on/off links in runtime systems

ACM Transactions on parallel Computing 2015年第2期1卷 1–21页

作者： Totoni, Ehsan Jain, Nikhil Kale, Laxmikant V. Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign UrbanaIL61801 United States

Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches. We discuss why the runtime system is the best system component to accomplish this task, and test the effectiveness of our approach using real applications (including NAMD, MILC), and application benchmarks (including NAS parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D Torus and multilevel directly connected network (similar to IBM PERCS in Power 775 and Dragonfly in Cray Aries). For common applications with near-neighbor communication pattern, our approach can save up to 20% of total machine's power and energy, without any performance penalty. © 2015 ACM.

关键词： Topology

来源：评论

学校读者我要写书评

暂无评论

Preventing the explosion of exascale profile data with smart thread-level aggregation 4

Preventing the explosion of exascale profile data with smart...

引用

4th Workshop on Extreme Scale programming Tools, ESPT 2015

作者： Lorenz, Daniel Shudler, Sergei Wolf, Felix Laboratory for Parallel Programming Technische Universität Darmstadt Darmstadt Germany

ISBN: (纸本)9781450339971

State of the art performance analysis tools, such as Score-P, record performance profoles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profioles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM-aggregates all threads of a process into a process profile;(ii) SET-calculates statistical key data as well as the sum;(iii) KEY-identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads;(iv) CALLTREE-clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compression ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time. © 2015 ACM.

关键词： Aggregates

来源：评论

学校读者我要写书评

暂无评论

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

引用

Journal of Computer Science & Technology 2012年第2期27卷 240-255页

作者：徐新海杨学军薛京灵林宇斐林一松 National Laboratory for Parallel and Distributed Processing School of ComputerNational University of Defense Technology Programming Languages and Compilers Group School of Computer Science and Engineering University of New South Wales

GPGPUs are increasingly being used to as performance accelerators for HPC （High Performance Computing） applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world＇s fastest supercomputer in the TOP500 list, built at NUDT （National University of Defense Technology） last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT （single-instruction, multiple-thread） characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC （a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs） shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

关键词： GPGPU partial recomputing fault tolerance CUDA checkpointing

来源：评论

学校读者我要写书评

暂无评论

Adaptive MPI

Adaptive MPI

引用

16th International Workshop on Languages and Compilers for parallel Computing, LCPC 2003

作者： Huang, Chao Lawlor, Orion Kalé, L.V. Parallel Programming Laboratory University of Illinois at Urbana-Champaign United States

ISBN: (纸本)9783540246442

Processor virtualization is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management. Charm++ is an early language/system that supports processor virtualization. This paper describes Adaptive MPI or AMPI, an MPI implementation and extension, that supports processor virtualization. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors. With this runtime system, AMPI supports such features as automatic adaptive overlap of communication and computation and automatic load balancing. It can also support other features such as check pointing without additional user code, and the ability to shrink and expand the set of processors used by a job at runtime. This paper describes AMPI, its features, benchmarks that illustrate performance advantages and tradeoffs offered by AMPI, and application experiences. © Springer-Verlag Berlin Heidelberg 2004.

关键词： Degrees of freedom (mechanics)

来源：评论

学校读者我要写书评

暂无评论

A parallel framework for explicit FEM 7th

引用

7th International Conference on High Performance Computing, HiPC 2000

作者： Bhandarkar, Milind A. Kalé, Laxmikant V. Parallel Programming Laboratory Department of Computer Science University of Illinois Urbana-Champaign United States

ISBN: (纸本)3540414290

As a part of an ongoing effort to develop a "standard library" for scientific and engineering parallel applications, we have developed a preliminary finite element framework. This framework allows an application scientist interested in modeling structural properties of materials, including dynamic behavior such as crack propagation, to develop codes that embody their modeling techniques without having to pay attention to the parallelization process. The resultant code modularly separates parallel implementation techniques from numerical algorithms. As the framework builds upon an object-based load balancing framework, it allows the resultant applications to automatically adapt to load imbalances resulting from the application or the environment (e.g. timeshared clusters). This paper presents results from the first version of the framework, and demonstrates results on a crack propagation application. © Springer-Verlag Berlin Heidelberg 2000.

关键词： Crack propagation

来源：评论

学校读者我要写书评

暂无评论

Run-time support for adaptive load balancing

Run-time support for adaptive load balancing

引用

15 Workshops Held in Conjunction with the IEEE International parallel and Distributed Processing Symposium, IPDPS 2000

作者： Bhandarkar, Milind A. Brunner, Robert K. Kalé, Laxmikant V. Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign United States

ISBN: (纸本)354067442X

Many parallel scientific applications have dynamic and irregular computational structure. However, most such applications exhibit persistence of computational load and communication structure. This allows us to embed measurement-based automatic load balancing frame-work in run-time systems of parallel languages that are used to build such applications. In this paper, we describe such a framework built for the Converse [4] interoperable runtime system. This framework is composed of mechanisms for recording application performance data, a mechanism for object migration, and interfaces for plug-in load balancing strategy objects. Interfaces for strategy objects allow easy implementation of novel load balancing strategies that could use application characteristics on the entire machine, or only a local neighborhood. We present the performance of a few strategies on a synthetic benchmark and also the impact of automatic load balancing on an actual application. © 2000 Springer-Verlag Berlin Heidelberg.

关键词： Interoperability

来源：评论

学校读者我要写书评

暂无评论

Using shared arrays in message-driven parallel programs

Using shared arrays in message-driven parallel programs

引用

25th IEEE International parallel and Distributed Processing Symposium, Workshops and Phd Forum, IPDPSW 2011

作者： Miller, Phil Becker, Aaron Kalé, Laxmikant Parallel Programming Laboratory Department of Computer Science University of Illinois Urbana-Champaign United States

ISBN: (纸本)9780769543857

This paper describes a safe and efficient combination of the object-based message-driven execution and shared array parallel programming models. In particular, we demonstrate how this combination engenders the composition of loosely coupled parallel modules safely accessing a common shared array. That loose coupling enables both better flexibility in parallel execution and greater ease of implementing multi-physics simulations. As a case study, we describe how the parallelization of a new method for molecular dynamics simulation benefits from both of these advantages. We also describe a system of typed handle objects that embed some of the determinacy constraints of the Multiphase Shared Array programming model in the C type system, to catch some violations at compile time. The combined programming model communicates in terms of these handles as a natural means of detecting and preventing errors. © 2011 IEEE.

关键词： Molecular dynamics

来源：评论

学校读者我要写书评

暂无评论

Preserving the original MPI semantics in a virtualized processor environment

Preserving the original MPI semantics in a virtualized proce...

引用

作者： Rodrigues, Eduardo R. Navaux, Philippe O.A. Panetta, Jairo Mendes, Celso L. IBM Research Brazil Institute of Informatics UFRGS Brazil Center for Weather Forecasts and Climate Studies INPE Brazil Parallel Programming Laboratory UIUC United States

Processor virtualization is a technique in which a programmer divides a computation into many entities, which are mapped to the available processors. The number of these entities, referred to as virtual processors, is typically larger than the number of physical processors. For an MPI program, the user decomposes the computation into more MPI tasks than physical processors. This approach allows overlapping computation and communication, and enables load balancing. User-level threads are often used to implement these virtual processors because they are generally faster to create, manage and migrate than heavy processes or kernel threads. However, these threads present issues concerning private data because they break the private address space assumption typically made by MPI programs. In this paper, we propose a new approach to privatize data in user-level threads. This approach is based on thread-local storage (TLS), which is often used by kernel threads. We apply this technique so that MPI programs can be executed in a virtualized environment while preserving their original semantics. We show that this alternative has a more efficient context switch and lower migration cost and is simpler to implement than other approaches. © 2012 Elsevier B.V. All rights reserved.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

Accelerating Brain Simulations with the Fast Multipole Method

arXiv

引用

arXiv 2023年

作者： Nöttgen, Hannah Czappa, Fabian Wolf, Felix Laboratory for Parallel Programming Technical University of Darmstadt Darmstadt Germany

The brain is probably the most complex organ in the human body. To understand processes such as learning or healing after brain lesions, we need suitable tools for brain simulations. The Model of Structural Plasticity offers a solution to that problem. It provides a way to model the brain bottom-up by specifying the behavior of the neurons and using structural plasticity to form the synapses. However, its original formulation involves a pairwise evaluation of attraction kernels, which drastically limits scalability. While this complexity has recently been decreased to O(n · log2 n) after reformulating the task as a variant of an n-body problem and solving it using an adapted version of the Barnes-Hut approximation, we propose an even faster approximation based on the fast multipole method (FMM). The fast multipole method was initially introduced to solve pairwise interactions in linear time. Our adaptation achieves this time complexity, and it is also faster in practice than the previous approximation. © 2023, CC BY-NC-ND.

关键词： Scalability

来源：评论

学校读者我要写书评

暂无评论

A Dynamic Resource Management System for Network-Attached Accelerator Clusters

A Dynamic Resource Management System for Network-Attached Ac...

引用

International Conference on parallel Processing (ICPP)

作者： Suraj Prabhakaran Mohsin Iqbal Sebastian Rinke Felix Wolf German Research School for Simulation Sciences Laboratory for Parallel Programming Aachen Germany

ISBN: (纸本)9781479914487

Over the years, cluster systems have become increasingly heterogeneous by equipping cluster nodes with one or more accelerators such as graphic processing units (GPU). These devices are typically attached to a compute node via PCI Express. As a consequence, batch systems such as TORQUE/Maui and SLURM have been extended to be aware of those additional resources tightly coupled with compute nodes. Recent advances in accelerator technology have given rise to the possibility of using network-attached accelerators in addition to node-attached accelerators. However, current batch systems do not support this new usage scenario of accelerators. This work focuses on the support for batch systems for allocating network-attached accelerators. The most important feature of the proposed batch system is its ability to dynamically allocate network-attached accelerators to jobs at application runtime. We discuss our extensions to the TORQUE and Maui batch system and elaborate on its features in the Dynamic Accelerator-Cluster Architecture, which describes an integration of network-attached accelerators into a cluster system. We also evaluate the dynamic allocation scenarios and show how batch systems can be designed to provide support for more flexible and dynamic cluster systems.

关键词： Resource management Computer architecture Dynamic scheduling Torque Graphics processing units Servers Method of moments

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：