The advances in hardware technology enable the inclusion of an SMP node into a cluster of PC or even clusters of SMP. These are becoming viable alternatives for high performance computing. The challenge is the explora...
详细信息
ISBN:
(纸本)9780780384309
The advances in hardware technology enable the inclusion of an SMP node into a cluster of PC or even clusters of SMP. These are becoming viable alternatives for high performance computing. The challenge is the exploration of the computational resources that these hardware platforms provide. A hybrid programming paradigm which uses a shared memory architecture through multi threading and through a message passing model for inter node communication is an alternative. However, programming in such a paradigm is very hard. This work presents CPAR-Cluster, a runtime system that provides shared memory abstraction on top of a cluster composed by mono and multiprocessor nodes. Its implementation is at the library level and does not require special resources such as specific hardware or operating system modifications. Models, strategies, implementation aspects and some results are presented.
The significant performance-to-cost ratio advantage of clusters, combined with recent advances in middleware (programming environment) and networking technologies, has made them the single most popular and fastest gro...
详细信息
ISBN:
(纸本)9780780384309
The significant performance-to-cost ratio advantage of clusters, combined with recent advances in middleware (programming environment) and networking technologies, has made them the single most popular and fastest growing platform for high performance computing in recent years. While the message passing interface (MPI) still dominates as a means of parallel programming in clusters, it is nevertheless desirable for programmers to program in a single address space, not only across a cluster but also among multiple, likely heterogeneous, clusters so as to significantly extend the computing power of a single cluster. In this paper we propose a distributed shared object (DSO) model based on a distributed hierarchical consistency model (DHCM) protocol for heterogeneous clusters. DHCM, inspired by but significantly improved over the local consistency, is designed to help maintain coherence and consistency in a DSO programming environment and to adapt to different levels of consistency. The notion of adaptive consistency is proposed and partially implemented to improve the efficiency in consistency control, and scalability is addressed as well through the hierarchical structure of the protocol design. We implemented this model purely in Java for portability and heterogeneity. The performance of DHCM is evaluated by executing the LU application chosen from the SPLASH-2 benchmark suite on a 128-node Linux cluster. The experimental results show that the protocol with a hierarchical structure significantly outperforms the protocol with a single-tier in terms of execution time, indicating higher scalability.
This paper describes the definition and implementation of an OpenMP-like set of directives and library routines for shared memory parallel programming in Java, A specification of the directives and routines is propose...
详细信息
This paper describes the definition and implementation of an OpenMP-like set of directives and library routines for shared memory parallel programming in Java, A specification of the directives and routines is proposed and discussed. A prototype implementation, consisting of a compiler and a runtime library, both written entirely in Java, is presented, which implements most of the proposed specification. Some preliminary performance results are reported. Copyright (C) 2001 John Wiley & Sons, Ltd.
We present a framework for the parallelization of depth-first combinatorial search algorithms on a network of computers. Our architecture is intended for a distributed setting and uses a work stealing strategy coupled...
详细信息
We present a framework for the parallelization of depth-first combinatorial search algorithms on a network of computers. Our architecture is intended for a distributed setting and uses a work stealing strategy coupled with a small number of primitives for the processors (which we call workers) to obtain new work and to communicate to other workers. These primitives are a minimal imposition and integrate easily with constraint programming systems. The main contribution is an adaptive architecture, which allows workers to incrementally join and leave and has good scaling properties as the number of workers increases. Our empirical results illustrate that near-linear speedup for backtrack search is achieved for up to 61 workers. It suggests that near-linear speedup is possible with even more workers. The experiments also demonstrate where departures from linearity can occur for small problems, and also for problems where the parallelism can itself affect the search as in branch and bound.
Summary form only given. We describe the parallelization of the multizone code versions of the NAS parallel benchmarks employing multilevel OpenMP parallelism. For our study we use the NanosCompiler, which supports ne...
详细信息
Summary form only given. We describe the parallelization of the multizone code versions of the NAS parallel benchmarks employing multilevel OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multilevel parallel applications.
Local area networks are now widely used to run parallel applications. They are particularly suitable to run I/O intensive applications, because each node usually includes disk space. However, programming these applica...
详细信息
Local area networks are now widely used to run parallel applications. They are particularly suitable to run I/O intensive applications, because each node usually includes disk space. However, programming these applications is rather difficult. The programmer must partition data into disk nudes and, at run time, transfer data from disk space into the memory of each node that uses the data and vice versa. Also, such partitioning gives data a fixed location in disk space, and is usually not adequate for performance because processors do not access mostly data in their local memory or disk, but on disk space in remote nodes. This paper presents a distributed parallel file system that both eases the programming and improves the performance of parallel I/O intensive applications. Our file system eases programming through mapping into memory files of size up to hundreds of Giga bytes. It improves performance through automatically diffusing, or migrating and replicating, data in files to the local memory or local disk of the processors that use the data. Data diffusion occurs under a multiple-readers-single-writer protocol. On applications tested the performance gain can be up to 20 % compared to versions using the MPI file system.
A 2K/spl times/8 bit EEPROM memory, which operates with a single 3.3 V power supply based on SMIC 0.35 /spl mu/m EEPROM process, has been developed. Several key design techniques are summarized. An improved read out c...
详细信息
ISBN:
(纸本)0780386477
A 2K/spl times/8 bit EEPROM memory, which operates with a single 3.3 V power supply based on SMIC 0.35 /spl mu/m EEPROM process, has been developed. Several key design techniques are summarized. An improved read out circuit that consists of SA (sense amplifier), bit line decoding and an optimized logic circuit to minimize the read access time, is described particularly, as well as the approaches to optimize the program operation and to generate on-chip high voltage. A 40 ns typical read access time and 2 ms page programming time are achieved. The active and standby currents are 10 mA and 100 /spl mu/A respectively.
The LMS algorithm is commonly used in the optimum design of the adaptive filter, because the LMS adaptive algorithm is a simple algorithm and it can be realized easily. But the convergence behavior and maladjustment o...
详细信息
ISBN:
(纸本)0780386477
The LMS algorithm is commonly used in the optimum design of the adaptive filter, because the LMS adaptive algorithm is a simple algorithm and it can be realized easily. But the convergence behavior and maladjustment of the LMS algorithm is seriously affected by the step-size, and the optimum parameter of step-size cannot be calculated easily. Evolutionary programming is an optimum algorithm in which the optimization of N-dimensions real-numbers are research objects. In this paper, the FIR filter is an example. In the design of the adaptive filter, we use a fast evolutionary programming algorithm. Cauchy mutation takes the place of Gauss mutation for improving the speed of the convergence. This algorithm is not dependent on any parameter; we can get a good result by the simulation and indicate the validity of the algorithm.
parallel programming effort can be reduced by using high-level constructs such as algorithmic skeletons. Within the Magda toolset, supporting programming and execution of mobile agent based distributed applications, w...
详细信息
ISBN:
(纸本)3540008527
parallel programming effort can be reduced by using high-level constructs such as algorithmic skeletons. Within the Magda toolset, supporting programming and execution of mobile agent based distributed applications, we provide a skeleton-based parallel programming environment, based on specialization of Algorithmic Skeleton Java interfaces and classes. Their implementation include mobile agent features for execution on heterogeneous systems, such as clusters of WSs and PCs, and support reliability and dynamic workload balancing. The user can thus develop a parallel, mobile agent based application by simply specialising a given set of classes and methods and using a set of added functionalities.
A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means ofskeletons- high-level patterns with parallel semantics. The main weakness of the current programming ...
详细信息
A Functional Abstract Notation (FAN) is proposed for the specification and design of parallel algorithms by means ofskeletons- high-level patterns with parallel semantics. The main weakness of the current programming systems based on skeletons ii that the user is still responsible for finding the most appropriate skeleton composition for a given application and a given parallel architectureWe describe a transformational framework for the development of skeletal programs which is aimed at filling this gap. The framework makes use oftransformation ruleswhich are semantic equivalences among skeleton compositions. For a given problem, an initial, possibly inefficient skeleton specification is refined by applying a sequence of transformations. Transformations are guided by a set of performance prediction models which forecast the behavior of each skeleton and the performance benefits of different rules. The design process is supported by a graphical tool which locates applicable transformations and provides performance estimates, thereby helping the programmer in navigating through the program refinement space. We give an overview of the FAN framework and exemplify its use with performance-directed program derivations for simple case studies. Our experience can be viewed as a first feasibility study of methods and tools for transformational, performance-directed parallel programming using skeletons.
暂无评论