Java's support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obt...
详细信息
This framework promises new classes of service, especially in terms of security, for policy-based development of distributed and collaborative applications.
This framework promises new classes of service, especially in terms of security, for policy-based development of distributed and collaborative applications.
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime per...
详细信息
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications.
Adaptive applications have computational workloads and communication patterns that change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient para...
详细信息
Adaptive applications have computational workloads and communication patterns that change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications is therefore a challenging task. In this paper, we compare the performance of and the programming effort required for two major classes of adaptive applications under three leading parallel programming models on an SGI Origin2000 system, a machine that supports all three models efficiently. Results indicate that the three models deliver comparable performance;however, the implementations differ significantly beyond merely using explicit messages versus implicit loads/stores even though the basic parallel algorithms are similar. Compared with the message-passing (using MPI) and SHMEM programming models, the cache-coherent shared address space (CC-SAS) model provides substantial ease of programming at both the conceptual and program orchestration levels, often accompanied by performance gains. However, CC-SAS currently has portability limitations and may suffer from poor spatial locality of physically distributed shared data on large numbers of processors. (C) 2002 Elsevier Science (USA).
The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet ...
详细信息
The optimized handling of reductions on parallel supercomputers or clusters of workstations is critical to high performance because reductions are common in scientific codes and a potential source of bottlenecks. Yet in many high-level languages, a mechanism for writing efficient reductions remains surprisingly absent. Further, when such mechanisms do exist, they often do not provide the flexibility a programmer needs to achieve a desirable level of performance. In this paper, we present a new language construct for arbitrary reductions that lets a programmer achieve a level of performance equal to that achievable with the highly flexible, but low-level combination of Fortran and MPI. We have implemented this construct in the ZPL language and evaluate it in the context of the initialization of the NAS MG benchmark. We show a 45 times speedup over the same code written in ZPL without this construct. In addition, performance on a large number of processors surpasses that achieved in the NAS implementation showing that our mechanism provides programmers with the needed flexibility.
The Zoltan library is a collection of data management services for parallel, unstructured, adaptive, and dynamic applications that is available as open-source software. It simplifies the load-balancing, data movement,...
详细信息
The Zoltan library is a collection of data management services for parallel, unstructured, adaptive, and dynamic applications that is available as open-source software. It simplifies the load-balancing, data movement, unstructured-communication, and memory usage difficulties that arise in dynamic applications such as adaptive finite-element methods, particle methods, and crash simulations. Zoltan's data-structure-neutral design also lets a wide range of applications use it without imposing restrictions on application data structures. Its object-based interface provides a simple and inexpensive way for application developers to use the library and researchers to make new capabilities available under a common interface
In the paper, the chaos-parallel evolutionary programming algorithm is presented to solve the flow-shop scheduling problem. First, the individuals of each sub-population in the parallel evolutionary programming are fo...
详细信息
In the paper, the chaos-parallel evolutionary programming algorithm is presented to solve the flow-shop scheduling problem. First, the individuals of each sub-population in the parallel evolutionary programming are found in the search space by use of the ergodicity properties of chaos states, then each sub-population evolves independently and the best individuals are exchanged between them periodically. Simulation results demonstrate that the new algorithm is efficient for optimizing large scale manufacturing process and the better results can be achieved on both the calculating time and optimizing rate.
In this paper, we describe the design and implementation of a portable run-time system for GOP, a graph-oriented programming framework aiming at providing high-bevel abstractions for configuring and programming cooper...
详细信息
ISBN:
(纸本)0769509363
In this paper, we describe the design and implementation of a portable run-time system for GOP, a graph-oriented programming framework aiming at providing high-bevel abstractions for configuring and programming cooperative parallel processes. The runtime system provides an interface with a library of programming primitives to the low-level facilities required to support graph-oriented communications and synchronization. The implementation is on top of the parallel Virtual Machine (PVM) in a local area network of Sun workstations. Issues related to the implementation of graph operations in a distributed environment are discussed. Performance of the runtime system is evaluated by estimating the overheads associated with using GOP primitives as opposed to PVM.
We present a synchronous parallel programming model designed for massively parallel fine grained applications such as cellular automata, finite element methods or partial differential equations. In this model we assum...
详细信息
We present a synchronous parallel programming model designed for massively parallel fine grained applications such as cellular automata, finite element methods or partial differential equations. In this model we assume that the number of parallel processes in a program is much larger than the number of processors of the machine on which it is run. We present the computational model and the communication model. We introduce the virtual cellular machine, an abstract machine implementing this programming model which requires means to simulate efficiently the execution of many processes on a single processor; and to use the available communication bandwidth efficiently. Finally, we show an example program written in a prototype language designed for programming the virtual machine.
Shared object Distributed Shared Memory (DSM) minimizes the problem of false sharing by allowing programmer to control the sharing size. This shared object approach for distributed parallel programming works well in t...
详细信息
ISBN:
(纸本)0769517609
Shared object Distributed Shared Memory (DSM) minimizes the problem of false sharing by allowing programmer to control the sharing size. This shared object approach for distributed parallel programming works well in task parallelism but not in data parallelism. When the data of a shared object is being modified, a lock on that object must be enforced to exclude any concurrent access on that same object. If the shared data within an object is large, internal false sharing would become a problem. We present a multi-locking mechanism for shared object DSM which allows multiple locks be applied to the different data sets of a shared object and thus enhances its concurrency power.
暂无评论