检索结果-内蒙古大学图书馆

14th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Schneider, Scott Yeom, Jae-Seung Rose, Benjamin Linford, John C. Sandu, Adrian Nikolopoulos, Dimitrios S. Virginia Tech Dept Comp Sci Blacksburg VA 24060 USA

ISBN: (纸本)9781605583976

On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. this task can be complex and error-prone even for expert programmers. Before we can allow compilers to handle the complexity for us, we must identify the abstractions that are general enough to allow us to write applications with reasonable effort, yet specific enough to exploit the vast on-chip memory bandwidth of EMM multi-processors. To this end, we compare two programming models against hand-tuned codes on the STI Cell, paying attention to programmability and performance. the first programming model, Sequoia, abstracts the memory hierarchy as private address spaces, each corresponding to a parallel task. the second, Cellgen, is a new framework which provides OpenMP-like semantics and the abstraction of a shared address spaces divided into private and shared data. We compare three applications programmed using these models against their hand-optimized counterparts in terms of abstractions, programming complexity, and performance.

关键词： Design Languages Cell BE Explicitly Managed Memory Hierarchies programming Models

来源：评论

学校读者我要写书评

暂无评论

Petascale Computing with Accelerators

Petascale Computing with Accelerators

引用

14th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Kistler, Michael Gunnels, John Brokenshire, Daniel Benton, Brad IBM Corp Austin TX 78758 USA IBM Corp Yorktown Hts NY 10598 USA

ISBN: (纸本)9781605583976

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. this system combines traditional x86-64 host processors with IBM PowerXCell (TM) 8i accelerator processors. the implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.

关键词： Algorithms Performance Design Accelerators hybrid programming models

来源：评论

学校读者我要写书评

暂无评论

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithrea...

引用

14th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Tallent, Nathan R. Mellor-Crummey, John M. Rice Univ Houston TX 77251 USA

ISBN: (纸本)9781605583976

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. this paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. this technique applies broadly to programming models ranging from explicit threading (e. g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead-when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. this requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCTOOLKIT performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

关键词： Performance Measurement Algorithms Performance Analysis Call Path Profiling Multithreaded programming Models HPCTOOLKIT

来源：评论

学校读者我要写书评

暂无评论

Formal Verification of Practical MPI Programs

Formal Verification of Practical MPI Programs

引用

14th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Vo, Anh Vakkalanka, Sarvani DeLisi, Michael Gopalakrishnan, Ganesh Kirby, Robert M. thakur, Rajeev Univ Utah Sch Comp Salt Lake City UT 84112 USA Argonne Natl Lab Div Math & Comp Sci Argonne IL 60439 USA

ISBN: (纸本)9781605583976

this paper considers the problem of formal verification of MPI programs operating under a fixed test harness for safety properties without building verification models. In our approach, we directly model-check the MPI/C source code, executing its interleavings with the help of a verification scheduler. Unfortunately, the total feasible number of interleavings is exponential, and impractical to examine even for our modest goals. Our earlier publications formalized and implemented a partial order reduction approach that avoided exploring equivalent interleavings, and presented a verification tool called ISP. this paper presents algorithmic and engineering innovations to ISP, including the use of OpenMP parallelization, that now enables it to handle practical MPI programs, including: (i) ParMETIS - a widely used hypergraph partitioner, and (ii) MADRE - a Memory Aware Data Re-distribution Engine, both developed outside our group. Over these benchmarks, ISP has automatically verified up to 14K lines of MPI/C code, producing error traces of deadlocks and assertion violations within seconds.

关键词： Verification MPI Message Passing Interface distributed programming model checking dynamic partial order reduction

来源：评论

学校读者我要写书评

暂无评论

PPoPP'08 - Proceedings of the 2008 acm SIGPLAN symposium on principles and practice of parallel programming

PPoPP'08 - Proceedings of the 2008 ACM SIGPLAN Symposium on ...

引用

13th acm SIGPLAN symposium on principles and practice of parallel programming, PPoPP'08

ISBN: (纸本)9781595939609

the proceedings contain 42 papers. the topics discussed include: automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories;type inference for locality analysis of distributed data structures;quasi-static scheduling for safe futures;scalable packet classification using interpreting: a cross-platform multi-core solution;FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue;matrix product on heterogeneous master-worker platforms;high performance dense linear algebra on a spatially distributed processor;optimization principles and application performance evaluation of a multithreaded GPU using CUDA;a case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding;programming with tiles;design and implementation of a high-performance MPI for C# and the common language infrastructure;and a portable runtime interface for multi-level memory hierarchies.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Architectural Support for Cilk Computations on Many-core Architectures

引用

acm SIGPLAN NOTICES 2009年第4期44卷 285-286页

作者： Long, Guoping Fan, Dongrui Zhang, Junchao Chinese Acad Sci Inst Comp Technol Key Lab Comp Syst & Architecture Beijing 100864 Peoples R China

来源：评论

学校读者我要写书评

暂无评论

Exploiting Global Optimizations for OpenMP Programs in the OpenUH Compiler

引用

acm SIGPLAN NOTICES 2009年第4期44卷 289-290页

作者： Huang, Lei Eachempati, Deepak Hervey, Marcus W. Chapman, Barbara Univ Houston Dept Comp Sci Houston TX 77004 USA

the advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs a limited set of optimizations for OpenMP programs due to its conservative assumptions of shared memory programming. these limitations may prevent some OpenMP applications from being fully optimized to the extent of its sequential counterpart. this paper describes our design and implementation of a parallel data flow framework, consisting of a parallel Control Flow Graph (PCFG) and a parallel SSA (PSSA) representation in OpenUH, to model data flow for OpenMP programs. this framework enables the OpenUH compiler to perform all classical scalar optimizations for OpenMP programs, in addition to conducting OpenMP specific optimizations.

关键词： Language Performance theory Compiler Analysis OpenMP parallel SSA

来源：评论

学校读者我要写书评

暂无评论

Stack-Based parallel Recursion on Graphics Processors

引用

acm SIGPLAN NOTICES 2009年第4期44卷 299-300页

作者： Yang, Ke He, Bingsheng Luo, Qiong Sander, Pedro V. Shi, Jiaoying Zhejiang Univ Hangzhou Zhejiang Peoples R China

Recent research has shown promising results on using graphics processing units (GPUs) to accelerate general-purpose computation. However, today's GPUs do not support recursive functions. As a result, for inherently recursive algorithms such as tree traversal, GPU programmers need to explicitly use stacks to emulate the recursion. parallelizing such stack-based implementation on the GPU increases the programming difficulty;moreover, it is unclear how to improve the efficiency of such parallel implementations. As a first step to address both ease of programming and efficiency issues, we propose three parallel stack implementation alternatives that differ in the granularity of stack sharing. Taking tree traversals as an example, we study the performance tradeoffs between these alternatives and analyze their behaviors in various situations. Our results could be useful to both GPU programmers and GPU compiler writers.

关键词： Algorithms Languages Stack parallel Recursion Graphics Processors

来源：评论

学校读者我要写书评

暂无评论

Preliminary Results on NB-FEB, a Synchronization Primitive for parallel programming

引用

acm SIGPLAN NOTICES 2009年第4期44卷 295-296页

作者： Ha, Phuong Hoai Tsigas, Philippas Anshus, Otto J. Univ Tromso N-9001 Tromso Norway Chalmers Univ Technol Gothenburg Sweden

We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).

关键词： Algorithms Reliability theory many-core architectures non-blocking synchronization full/empty bit universal primitives combinability

来源：评论

学校读者我要写书评

暂无评论

Application-Aware Management of parallel Simulation Collections

引用

acm SIGPLAN NOTICES 2009年第4期44卷 35-44页

作者： Yau, Siu-Man Karamcheti, Vijay Damevski, Kostadin Parker, Steven G. Zorin, Denis NYU Courant Inst Math Sci New York NY 10003 USA Univ Utah Dept Comp Sci Salt Lake City UT 84112 USA

this paper presents a system deployed on parallel clusters to manage a collection of parallel simulations that make up a computational study. It explores how such a system can extend traditional parallel job scheduling and resource allocation techniques to incorporate knowledge specific to the study. Using a UINTAH-based helium gas simulation code (ARCHES) and the SimX system for multi-experiment computational studies, this paper demonstrates that, by using application-specific knowledge in resource allocation and scheduling decisions, one can reduce the run time of a computational study from over 20 hours to under 4.5 hours on a 32-processor cluster, and from almost 11 hours to just over 3.5 hours on a 64-processor cluster.

关键词： Design Performance parallel System High-throughput computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：