检索结果-内蒙古大学图书馆

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 60-71页

作者： Taura, Kenjiro Tabata, Kunio Yonezawa, Akinori Univ of Tokyo Tokyo Japan

An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and detaches the callee from the caller when the callee suspends or either of them migrates to another processor. Unlike previous similar systems, it detaches and connects arbitrary frames generated by off-the-shelf sequential compilers obeying calling standards. As a consequence, it requires neither a frontend preprocessor nor a native code generator that has a builtin notion of parallelism. the system practically works with unmodified GNU C compiler (GCC). Desirable extensions to sequential compilers for guaranteeing portability and correctness of the scheme are clarified and claimed modest. Experiments indicate that sequential performance is not sacrificed for practical applications and both sequential and parallel performance are comparable to Cilk, whose current implementation requires a fairly sophisticated preprocessor to C. these results show that efficient asynchronous calls (a.k.a. future calls) can be integrated into current calling standard with a very small impact both on sequential performance and compiler engineering.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Automatic node selection for high performance applications on networks

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 163-172页

作者： Subhlok, Jaspal Lieu, Peter Lowekamp, Bruce Univ of Houston Houston United States

A central problem in executing performance critical parallel and distributed applications on shared networks is the selection of computation nodes and communication paths for execution. Automatic selection of nodes is complex as the best choice depends on the application structure as well as the expected availability of computation and communication resources. this paper presents a solution to this problem for realistic application and network scenarios. A new algorithm to jointly analyze computation and communication resources for different application demands is introduced and a framework for automatic node selection is developed on top of Remos, which is a query interface to network information. the paper reports results from a set of applications, including Airshed pollution modeling and magnetic resonance imaging, executing on a high speed network testbed. the results demonstrate that node selection is effective in enhancing application performance in the presence of computation load as well as network traffic. Under the network conditions used for experiments, the increase in execution time due to compute loads and network congestion was reduced by half with node selection. the node selection algorithms developed in this research are also applicable to dynamic migration of long running jobs.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Predictive analysis of a wavefront application using LogGP

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 141-150页

作者： Sundaram-Stukel, David Vernon, Mary K. Epic Systems Corp Madison WI United States

this paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchronization structure, and (2) explicit high-fidelity models of the MPI-send and MPI-receive primitives. the MPI-send/receive models are used to derive L, o, and G from simple two-node micro-benchmarks. Other model parameters are obtained by measuring small application problem sizes on four SP nodes. Results show that the LogGP model predicts, in seconds and with a high degree of accuracy, measured application execution time for large problems running on 128 nodes. Detailed performance projections are provided for very large future processor configurations that are expected to be available to the application developers. these results indicate that scaling beyond one or two thousand nodes yields greatly diminished in execution time, and that synchronization delays are a principal factor limiting the scalability of the application.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Automatic parallelization of divide and conquer algorithms

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 72-83页

作者： Rugina, Radu Rinard, Martin Massachusetts Inst of Technology Cambridge United States

Divide and conquer algorithms are a good match for modern parallel machines: they tend to have large amounts of inherent parallelism and they work well with caches and deep memory hierarchies. But these algorithms pose challenging problems for parallelizing compilers. they are usually coded as recursive procedures and often use pointers into dynamically allocated memory blocks and pointer arithmetic. All of these features are incompatible with the analysis algorithms in traditional parallelizing compilers. this paper presents the design and implementation of a compiler that is designed to parallelize divide and conquer algorithms whose subproblems access disjoint regions of dynamically allocated arrays. the foundation of the compiler is a flow-sensitive, context-sensitive, and interprocedural pointer analysis algorithm. A range of symbolic analysis algorithms build on the pointer analysis information to extract symbolic bounds for the memory regions accessed by (potentially recursive) procedures that use pointers and pointer arithmetic. the symbolic bounds information allows the compiler to find procedure calls that can execute in parallel without violating the data dependences. the compiler generates code that executes these calls in parallel. We have used the compiler to parallelize several programs that use divide and conquer algorithms. Our results show that the programs perform well and exhibit good speedup.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Evaluation of predicated array data-flow analysis for automatic parallelization

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 84-95页

作者： Moon, Sungdo Hall, Mary W. Univ of Southern California Marina del Rey United States

this paper presents an evaluation of a new analysis for parallelizing compilers called predicated array data-flow analysis. this analysis extends array data-flow analysis for parallelization and privatization to associate predicates with data-flow values. these predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. these conditions can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation. As compared to previous work that combines predicates with array data-flow analysis, our approach is distinguished by two features: (1) it derives low-cost, run-time parallelization tests;and, (2) it incorporates predicate embedding and predicate extraction, which translate between the domain of predicates and data-flow values to derive more precise analysis results. We present extensive experimental results across three benchmark suites and one additional program, demonstrating that predicated array data-flow analysis parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler and that it yields improved speedups for 5 programs.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 107-118页

作者： Tang, Hong Shen, Kai Yang, Tao Univ of California Santa Barbara CA United States

MPI is a message-passing standard widely used for developing high-performance parallel applications. Because of the restriction in the MPI computation model, conventional implementations on shared memory machines map each MPI node to an OS process, which suffers serious performance degradation in the presence of multiprogramming, especially when a space/time sharing policy is employed in OS job scheduling. In this paper, we study compile-time and run-time support for MPI by using threads and demonstrate our optimization techniques for executing a large class of MPI programs written in C. the compile-time transformation adopts thread-specific data structures to eliminate the use of global and static variables in C code. the run-time support includes an efficient point-to-point communication protocol based on a novel lock-free queue management scheme. Our experiments on an SGI Origin 2000 show that our MPI prototype called TMPI using the proposed techniques is competitive with SGI's native MPI implementation in a dedicated environment, and it has significant performance advantages with up to a 23-fold improvement in a multiprogrammed environment.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Efficient implementation of Java's Remote Method Invocation

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 173-182页

作者： Maassen, Jason van Nieuwpoort, Rob Veldema, Ronald Bal, Henri E. Plaat, Aske Vrije Universiteit Amsterdam Netherlands

Java offers interesting opportunities for parallel computing. In particular, Java Remote Method Invocation provides an unusually flexible kind of Remote Procedure Call. Unlike RPC, RMI supports polymorphism, which requires the system to be able to download remote classes into a running application. Sun's RMI implementation achieves this kind of flexibility by passing around object type information and processing it at run time, which causes a major run time overhead. Using Sun's JDK 1.1.4 on a Pentium Pro/Myrinet cluster, for example, the latency for a null RMI (without parameters or a return value) is 1228 μsec, which is about a factor of 40 higher than that of a user-level RPC. In this paper, we study an alternative approach for implementing RML based on native compilation. this approach allows for better optimization, eliminate the need for processing of type information at run time, and makes a light weight communication protocol possible. We have built a Java system based on a native compiler, which supports both compile time and run time generation of marshallers. We find that almost all of the run time overhead of RMI can be pushed to compile time. With this approach, the latency of a null RMI is reduced to 34 μsec, while still supporting polymorphic RMIs (and allowing interoperability with other JVMs).

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Space-time memory: A parallel programming abstraction for interactive multimedia applications

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 183-192页

作者： Ramachandran, Umakishore Nikhil, Rishiyur S. Harel, Nissim Rehg, James M. Knobe, Kathleen Georgia Inst of Technology Atlanta United States

Realistic interactive multimedia involving vision, animation, and multimedia collaboration is likely to become an important aspect of future computer applications. the scalable parallelism inherent in such applications coupled with their computational demands make them ideal candidates for SMPs and clusters of SMPs. these applications have novel requirements that offer new kinds of challenges for parallel system design. We have designed a programming system called Stampede that offers many functionalities needed to simplify development of such applications (such as high-level data sharing abstractions, dynamic cluster-wide threads, and multiple address spaces). We have built Stampede and it runs on clusters of SMPs. To date we have implemented two applications on Stampede, one of which is discussed herein. In this paper we describe a part of Stampede called Space-Time Memory (STM). It is a novel data sharing abstraction that enables interactive multimedia applications to manage a collection of time-sequenced data items simply, efficiently, and transparently across a cluster. STM relieves the application programmer from low level synchronization and data communication by providing a high level interface that subsumes buffer management, inter-thread synchronization, and location transparency for data produced and accessed anywhere in the cluster. STM also automatically handles garbage collection of data items that will no longer be accessed by any of the application threads. We discuss ease of use issues for developing applications using STM, and present preliminary/performance results to show that STM's overhead is low.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Transparent adaptive parallelism on NOWs using OpenMP

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP 1999年 96-106页

作者： Scherer, Alex Lu, Honghui Gross, thomas Zwaenepoel, Willy ETH Zurich Zurich Switzerland

We present a system that allows OpenMP programs to execute on a network of workstations with a variable number of nodes. the ability to adapt to a variable number of nodes allows a program to take advantage of additional nodes that become available after it starts execution, or to gracefully scale down when the number of available nodes is reduced. We demonstrate that the cost of adaptation is modest;the system allows a program to adapt at a moderate rate without much performance loss. Two ideas underlie the efficiency of our design. First, we recognize that OpenMP programs exhibit convenient adaptation points during their execution, points at which the cost of adaptation can be much reduced. Second, by allowing a process a certain grace period before it must leave a node, we insure that most adaptations can occur at these adaptation points, and thus at low cost. Migration of a process, a much more expensive method for providing adaptivity, is used only as a back-up solution, when the process cannot reach an adaptation point within the grace period. Our implementation consists of an OpenMP pre-processor that generates TreadMarks distributed shared memory (DSM) programs, and a version of TreadMarks modified to adapt to a variable number of nodes. Using a DSM as the underlying substrate facilitates the data (re-)distribution necessary after an adaptation.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

proceedings of the 1997 6th acm sigplan symposium on principles and practice of parallel programming

Proceedings of the 1997 6th ACM SIGPLAN Symposium on Princip...

引用

proceedings of the 1997 6th acm sigplan symposium on principles and practice of parallel programming

the proceedings contains 25 papers. Topics discussed include data and task parallelism, irregular applications, coherence protocols, shared memory, compilers and performances issue.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：