检索结果-内蒙古大学图书馆

8th acm sigplan symposium on principles and practice of parallel programming

作者： Seon Wook Kim Ooi, C.-L. Eigenmann, R. Falsafi, B. Vijaykumar, T.N. Intel Corp. Champaign IL United States

Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. these architectures use hardware speculative storage to buffer uncertain data, track data dependences and roll back incorrect executions. Because all memory references access the speculative storage, current proposals implement this storage using small memory structures for fast access. the limited capacity of the speculative storage causes considerable performance loss due to speculative storage overflow whenever a thread's speculative state exceeds the storage capacity. Larger threads exacerbate the over-flow problem but are preferable to smaller threads, as larger threads uncover more parallelism. In this paper, we discover a new program property called memory reference idempotency. Idempotent references need not be tracked in the speculative storage, and instead can directly access non-speculative storage (i.e., the conventional memory hierarchy). thus, we reduce the demand fo r speculative storage space. We define a formal framework for reference idempotency and present a novel compiler-assisted speculative execution model. We prove the necessary and sufficient conditions for reference idempotency using our model. We present a compiler algorithm to label idempotent memory references for the hardware. Experimental results show that for our benchmarks, over 60% of the references in non-parallelizable program sections are idempotent.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Dynamic instrumentation of threaded applications

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 49-59页

作者： Xu, Zhichen Miller, Barton P. Naim, Oscar Univ of Wisconsin Madison WI United States

the design of non-trace based performance instrumentation techniques for threaded programs is investigated to provide detailed performance data while maintaining control of instrumentation costs. the design is based on low contention data structures. the Paradyn's dynamic instrumentation is extended to handle threaded programs. To associate data with individual threads, all threads must share the same instrumentation code and assign each thread with its own private copy of performance counters or timers. the asynchrony in a threaded program poses a major challenge to dynamic instrumentation.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Evaluation of computing paradigms for n-body simulations on distributed memory architectures

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 25-36页

作者： McCurdy, Collin Mellor-Crummey, John Univ of Wisconsin Madison United States

the efficiency of HPF with respect to irregular applications is still largely unproven. While recent work has shown that a highly irregular hierarchical n-body force calculation method can be implemented in HPF, we have found that the implementation contains inefficiencies which cause it to run up to a factor of three times slower than our hand-coded, explicitly parallel implementation. Our work examines these inefficiencies, determines that most of the extra overhead is due to a single aspect of the communication strategy, and demonstrates that fixing the communication strategy can bring the overheads of the HPF application to within 25% of those of the hand-coded version.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Code motion for explicitly parallel programs

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 13-24页

作者： Knoop, Jens Steffen, Bernhard Universitaet Dortmund Dortmund Germany

In comparison to automatic parallelization, which is thoroughly studied in the literature, classical analyses and optimizations of explicitly parallel programs were more or less neglected. this may be due to the fact that naive adaptations of the sequential techniques fail, and their straightforward correct ones have unacceptable costs caused by the interleavings, which manifest the possible executions of a parallel program. Recently, however, we showed that unidirectional bitvector analyses can be performed for parallel programs as easily and as efficiently as for sequential ones, a necessary condition for the successful transfer of the classical optimizations to the parallel setting. In this article we focus on possible subsequent code motion transformations, which turn out to require much more care than originally conjectured. Essentially, this is due to the fact that interleaving semantics, although being adequate for correctness considerations, fails when it comes to reasoning about efficiency of parallel programs. this deficiency, however, can be overcome by strengthening the specific treatment of synchronization points.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Performance prediction of large parallel applications using parallel simulations

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 151-162页

作者： Bagrodia, Rajive Deelman, Ewa Docy, Steven Phan, thomas Univ of California Los Angeles Los Angeles United States

Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. this paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communication and I/O intensive applications. the simulator has been used to predict the performance of such applications on both distributed memory machines like the IBM SP and shared-memory machines like the SGI Origin 2000. the paper illustrates the usefulness of COMPASS as a versatile performance prediction tool. We use both real-world applications and synthetic benchmarks to study application scalability, sensitivity to communication latency, and the interplay between factors like communication pattern and parallel file system caching on application performance. We also show that the simulator is accurate in its predictions and that it is also efficient in its ability to use parallel simulation to reduce its own execution time which, in some cases, has yielded a near-linear speedup.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Basic compiler algorithms for parallel programs

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 1-12页

作者： Lee, Jaejin Padua, David A. Midkiff, Samuel P. Univ of Illinois Urbana United States

Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. this is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. these optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

SUIF Explorer: An interactive and interprocedural parallelizer

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 37-48页

作者： Liao, Shih-Wei Diwan, Amer Bosch Jr., Robert P. Ghuloum, Anwar Lam, Monica S. Stanford Univ United States

the SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiting attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. the system guides the programmers in the parallelization process using a set of sophisticated visualization technique. this paper demonstrates the effectiveness of the SUIF Explorer with three case studies. the programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Stackthreads/MP: Integrating futures into calling standards

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 60-71页

作者： Taura, Kenjiro Tabata, Kunio Yonezawa, Akinori Univ of Tokyo Tokyo Japan

An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and detaches the callee from the caller when the callee suspends or either of them migrates to another processor. Unlike previous similar systems, it detaches and connects arbitrary frames generated by off-the-shelf sequential compilers obeying calling standards. As a consequence, it requires neither a frontend preprocessor nor a native code generator that has a builtin notion of parallelism. the system practically works with unmodified GNU C compiler (GCC). Desirable extensions to sequential compilers for guaranteeing portability and correctness of the scheme are clarified and claimed modest. Experiments indicate that sequential performance is not sacrificed for practical applications and both sequential and parallel performance are comparable to Cilk, whose current implementation requires a fairly sophisticated preprocessor to C. these results show that efficient asynchronous calls (a.k.a. future calls) can be integrated into current calling standard with a very small impact both on sequential performance and compiler engineering.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Automatic node selection for high performance applications on networks

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 163-172页

作者： Subhlok, Jaspal Lieu, Peter Lowekamp, Bruce Univ of Houston Houston United States

A central problem in executing performance critical parallel and distributed applications on shared networks is the selection of computation nodes and communication paths for execution. Automatic selection of nodes is complex as the best choice depends on the application structure as well as the expected availability of computation and communication resources. this paper presents a solution to this problem for realistic application and network scenarios. A new algorithm to jointly analyze computation and communication resources for different application demands is introduced and a framework for automatic node selection is developed on top of Remos, which is a query interface to network information. the paper reports results from a set of applications, including Airshed pollution modeling and magnetic resonance imaging, executing on a high speed network testbed. the results demonstrate that node selection is effective in enhancing application performance in the presence of computation load as well as network traffic. Under the network conditions used for experiments, the increase in execution time due to compute loads and network congestion was reduced by half with node selection. the node selection algorithms developed in this research are also applicable to dynamic migration of long running jobs.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Predictive analysis of a wavefront application using LogGP

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 141-150页

作者： Sundaram-Stukel, David Vernon, Mary K. Epic Systems Corp Madison WI United States

this paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchronization structure, and (2) explicit high-fidelity models of the MPI-send and MPI-receive primitives. the MPI-send/receive models are used to derive L, o, and G from simple two-node micro-benchmarks. Other model parameters are obtained by measuring small application problem sizes on four SP nodes. Results show that the LogGP model predicts, in seconds and with a high degree of accuracy, measured application execution time for large problems running on 128 nodes. Detailed performance projections are provided for very large future processor configurations that are expected to be available to the application developers. these results indicate that scaling beyond one or two thousand nodes yields greatly diminished in execution time, and that synchronization delays are a principal factor limiting the scalability of the application.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：