The SCOOPP (Scalable Object Oriented parallel programming) system efficiently adapts, at run-time, an object oriented parallel application to any distributed memory system. It extracts as much parallelism as possible ...
详细信息
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime per...
详细信息
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications.
OpenMP has become the de-facto standard for shared memory parallel programming. The directive based nature of OpenMP allows incremental and portable developement of parallel application for a wide range of platforms. ...
详细信息
This paper presents a system to produce efficient implementations of parallel array-based algorithms from high-level specifications. It is structured as a transformation through a series of progressively more detailed...
详细信息
ISBN:
(纸本)3540664432
This paper presents a system to produce efficient implementations of parallel array-based algorithms from high-level specifications. It is structured as a transformation through a series of progressively more detailed representations. This allows the use of high-level programming features without losing the fine control of low-level languages. During the transformation process, parallel implementation decisions are introduced. Finally, a representation is reached which can be translated to C+MPI.
We address the challenging problem of algorithm and program design for the Computational Grid by providing the application user with a set of high-level, parameterised components called skeletons . We descrile a Java-...
详细信息
We address the challenging problem of algorithm and program design for the Computational Grid by providing the application user with a set of high-level, parameterised components called skeletons . We descrile a Java-based Grid programming system in which algorithmns are composed of skeletons and the computational resources for executing individual skeletons are chosen using performance prediction. The advantage of our approach is that skeletons are reusable for different applications and that skeletons' implementation can be tuned to particular machines. The focus of this paper is on predicting performance for Grid applications constructed using skeletons.
We demonstrate that the run time of implicitly parallel programs can be statically predicted with considerable accuracy when expressed within the constraints of a skeletal, shapely parallel programming language. Our w...
详细信息
We demonstrate that the run time of implicitly parallel programs can be statically predicted with considerable accuracy when expressed within the constraints of a skeletal, shapely parallel programming language. Our work constitutes the first completely static system to account for both computation and Communication in such a context. We present details of our language and its BSP implementation strategy together with an account of the analysis mechanism. We examine the accuracy of our predictions against the performance of real parallel programs.
In this paper, vee describe an undergraduate parallel programming course based upon networked workstations, The course is offered on the North Carolina Research and Education Network (NC-REN), a private telecommunicat...
详细信息
In this paper, vee describe an undergraduate parallel programming course based upon networked workstations, The course is offered on the North Carolina Research and Education Network (NC-REN), a private telecommunications network which interconnects universities in North Carolina and provides multiway, face-to-face video, anti audio communications. Course materials are described and made available in a new textbook, Topics are divided into basic techniques and applications. In addition, extensive home page materials are described.
parallel computing on interconnected workstations is becoming a viable and attractive proposition due to the rapid growth in speeds of interconnection networks and processors. In the case of workstation clusters, ther...
详细信息
parallel computing on interconnected workstations is becoming a viable and attractive proposition due to the rapid growth in speeds of interconnection networks and processors. In the case of workstation clusters, there is always a considerable amount of unused computing capacity available in the network. However, heterogeneity in architectures and operating systems, load variations on machines, variations in machine availability, and failure susceptibility of networks and workstations complicate the situation for the programmer. In this context, new programming paradigms that reduce the burden involved in programming for distribution, load adaptability, heterogeneity, and fault tolerance gain importance. This paper identifies the issues involved in parallel computing on a network of workstations. The Anonymous Remote Computing (ARC) paradigm is proposed to address the issues specific to parallel programming on workstation systems. ARC differs from the conventional communicating process model by treating a program as one single entity consisting of several loosely coupled remote instruction blocks instead of treating it as a collection of processes. The ARC approach results in distribution transparency and heterogeneity transparency. At the same time, it provides fault tolerance and load adaptability to parallel programs on workstations. ARC is developed in a two-tiered architecture consisting of high level language constructs and low level ARC primitives. The paper describes an implementation of the ARC kernel supporting ARC primitives.
The performance of a parallel simulation system depends very much on partitioning simulation workload evenly among the set of processors in the computing environment to ensure load-balance between processors. Most par...
详细信息
The performance of a parallel simulation system depends very much on partitioning simulation workload evenly among the set of processors in the computing environment to ensure load-balance between processors. Most parallel simulation systems employ user-defined static partitioning. However static partitioning requires in-depth domain knowledge of the specific simulation model in the study. It is not effective if the workload of a simulation model could not be quantified accurately or changes over time during a simulation run. Dynamic load-balancing allows the simulation system to automatically balance the workload of different simulation models without user's input. In this paper the use of dynamic load-balancing in the context of the BSP Time Warp optimistic protocol is examined. Based on the BSP cost model, a dynamic load-balancing algorithm for the BSP Time Warp protocol is developed. Using different simulation models, the paper shows that to achieve consistent performance, the dynamic load-balancing algorithm for BSP Time Warp needs to consider both computation and communication workload, as well as lookaheads between processors.
In recent years, cluster computing has been accepted widely as a parallel platform because of its high performance at an affordable cost. To make the best use of the cluster computing resources, a resource monitoring ...
详细信息
In recent years, cluster computing has been accepted widely as a parallel platform because of its high performance at an affordable cost. To make the best use of the cluster computing resources, a resource monitoring program is needed. The information collected can be used by any parallel application, i.e. parallel motion estimation, for handling load variation in typical time-sharing computers. Therefore, the parallel workload can be distributed properly among n processors. In this paper, we present the development of resource monitoring for cluster computing using the MPI programming model and its application to parallel motion estimation. Results show the effectiveness of our method in which a faster parallel execution time can be achieved.
暂无评论