Clusters are attractive for executing sequential and parallel applications. However, there is a need to design a cluster distributed operating system to provide a Single System Image. A cluster operating system provid...
详细信息
Clusters are attractive for executing sequential and parallel applications. However, there is a need to design a cluster distributed operating system to provide a Single System Image. A cluster operating system providing both a DSM system and load balancing is attractive for efficiently executing a workload of sequential applications and shared memory parallel applications. Gobelins is a distributed operating system dedicated to clusters that provides both a DSM system and a process migration mechanism to support load balancing. In this paper, we present the implementation of Gobelins process migration mechanism which exploits Gobelins kernel level DSM system. We show that Gobelins DSM allows to implement simply an efficient migration mechanism, that can be used to move processes or threads among cluster nodes. A prototype of Gobelins has been implemented. Some performance results are presented in this paper.
In many parallel programs, run-time data redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. Recently researches in data redistribution...
详细信息
In many parallel programs, run-time data redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. Recently researches in data redistribution algorithm have become very mature. The time required to generate data sets and processor sets is much lesser then before. That means packing/unpacking becomes a relatively heavy cost in the redistribution. In this paper we present methods to perform BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution using MPI user-defined types. In this approach, we can reduce the requirement of memory buffers and avoid unnecessary data-movement. The theoretical models are presented to determine the best method for redistribution. To evaluate the performance of the proposed methods, we have implemented our methods on an IBM SP2 parallel machine. The experimental results show that this approach can obviously improve the performance of redistribution in most cases.
In the report the conceptual problems of the rise of productivity and reliability of specialized computing devices (SCD) of intelligent systems are considered on the basis of representation and data processing with th...
详细信息
In the report the conceptual problems of the rise of productivity and reliability of specialized computing devices (SCD) of intelligent systems are considered on the basis of representation and data processing with the help of various problem-oriented machine arithmetic (POMA). The directions of development of methods of representation and data processing are considered, and also the new methods are offered and the known methods are essentially advanced.
The following topics are dealt with: parallel, distributed and network-based processing; performance analysis; Web computing; failure handling; Java and Jini; parallel and distributed programming tools for grids; unor...
详细信息
ISBN:
(纸本)0769514448
The following topics are dealt with: parallel, distributed and network-based processing; performance analysis; Web computing; failure handling; Java and Jini; parallel and distributed programming tools for grids; unorthodox computing architectures; systems and applications; message passing; scheduling; algorithms; and mobile ad hoc networks.
parallel, multithreaded Java applications such as Web servers, database servers, and scientific applications are becoming increasingly prevalent. Most of them have high object instantiation rates through the new bytec...
详细信息
parallel, multithreaded Java applications such as Web servers, database servers, and scientific applications are becoming increasingly prevalent. Most of them have high object instantiation rates through the new bytecode that is implemented in a garbage collection subsystem typically. For aforementioned applications, traditional garbage collectors are often the bottleneck that limits program performance and processor utilization on multiprocessor systems. They suffer from long garbage collection pauses (stop-the-world mark-sweep algorithm) or inability of collecting cyclic garbage (reference counting approach). Generational garbage collection, however, is based only on the weak generational hypothesis that most objects die young. In this paper, a new multithreaded concurrent generational garbage collector (MCGC) based on mark-sweep with the assistance of reference counting is proposed. The MCGC can take advantage of multiple CPUs in an SMP system and the merits or light weight processes. Furthermore, the long garbage collection pause can be reduced and the garbage collection efficiency can be enhanced.
Summary form only given. In April of 1992, a group of parallel computing vendors, computer science researchers, and application scientists met at a one-day workshop and agreed to cooperate on the development of a comm...
详细信息
Summary form only given. In April of 1992, a group of parallel computing vendors, computer science researchers, and application scientists met at a one-day workshop and agreed to cooperate on the development of a community standard for the message-passing model of parallel computing. The MPI Forum that eventually emerged from that workshop became a model of how a broad community could work together to improve an important component of the high performance computing environment. The Message Passing Interface (MPI) definition that resulted from this effort has been widely adopted and implemented, and is now virtually synonymous with the message-passing model itself MPI not only standardized existing practice in the service of making applications portable in the rapidly changing world of parallel computing, but also consolidated research advances into novel features that extended existing practice and have proven useful in developing a new generation of applications. This talk will discuss some of the procedures and approaches of the MPI Forum that led to MPI's early adoption, and then describe some of the features that have led to its persistence as a reference model for parallel computing. Although clusters were only just emerging as a significant parallel computing production platform as MPI was being defined, MPI has proven to be a useful way of programming them for high performance, and we will discuss the current situation in MPI implementations for clusters. MPI was deliberately designed to grant considerable flexibility to implementors, and thus provides a useful framework for implementation research. Successful implementation techniques within the MPI standard can be utilized immediately by applications already using MPI, thus providing an unusually fast path front research results to their application. At Argonne National Laboratory we have been developing and distributing MPICH, a portable, high performance implementation of MPI, from the very beginning of t
We present a system-level design and programming method for embedded multiprocessor systems. The aim of the method is to improve the design time and design quality by providing a structured approach for implementing p...
详细信息
ISBN:
(纸本)1581135769
We present a system-level design and programming method for embedded multiprocessor systems. The aim of the method is to improve the design time and design quality by providing a structured approach for implementing process networks. We use process networks as re-usable and architecture-independent functional specifications. The method facilitates the cost-driven and constraint-driven source code transformation of process networks into architecture-specific implementations in the form of communicating tasks. We apply the method to implement a JPEG decoding process network in software on a set of MIPS processors. We apply three transformations to optimize synchronization rates and data transfers and to exploit data parallelism for this target architecture. We evaluate the impact of the source code transformations and the performance of the resulting implementations in terms of design time, execution time, and code size. The results show that process networks can be implemented quickly and efficiently on embedded multiprocessor systems.
In the past few years, cluster computing has been accepted widely as parallel platform because of its high performance at an affordable cost. To maximize the use of available resources, resource monitoring for cluster...
详细信息
In the past few years, cluster computing has been accepted widely as parallel platform because of its high performance at an affordable cost. To maximize the use of available resources, resource monitoring for cluster computing is required. The resource information collected can be used by any parallel applications, i.e. parallel motion estimation, for handling variation of available resources in typical time-sharing computers. Therefore, the computing load can be distributed properly among n processors. In this paper, we present the development of resource monitoring for cluster computing using MPI programming model to achieve efficient parallel motion estimation. Results show the effectiveness of our method in which the faster parallel execution time can be achieved.
UPC, or Unified parallel C, is a parallel extension of ANSI C. UPC follows a distributed shared memory programming model aimed at leveraging the ease of programming of the shared memory paradigm, while enabling the ex...
详细信息
ISBN:
(纸本)9780769515243
UPC, or Unified parallel C, is a parallel extension of ANSI C. UPC follows a distributed shared memory programming model aimed at leveraging the ease of programming of the shared memory paradigm, while enabling the exploitation of data locality. UPC incorporates constructs that allow placing data near the threads that manipulate them to minimize remote accesses. This paper gives an overview of the concepts and features of UPC and establishes, through extensive performance measurements of NPB workloads, the viability of the UPC programming language compared to the other popular paradigms. Further, through performance measurements we identify the challenges, the remaining steps and the priorities for UPC. It will be shown that with proper hand tuning and optimized collective operations libraries, UPC performance will be comparable to that of MPI. Furthermore, by incorporating such improvements into automatic compiler optimizations, UPC will compare quite favorably to message passing in ease of programming.
暂无评论