In this paper, we propose the new method for the parallel system design based on expanded the logical coloured Petri net (LCPN). An LCPN is an extended Petri net that solves the problem of system description in previo...
详细信息
In this paper, we propose the new method for the parallel system design based on expanded the logical coloured Petri net (LCPN). An LCPN is an extended Petri net that solves the problem of system description in previously proposed place/transition nets and coloured Petri nets. This extension of Petri nets is suitable for designing complex control systems and for discussing methods of evaluating such systems realistically. In order to study the behaviour of the server system modelled with this net we simulated a Java program. This program confirmed that this extended Petri net is an effective tool for modelling the parallel system.
In this paper we examine how a network processor can be modeled using object-oriented techniques. We examine the Intel IXP 1200 network processor and discuss how the object-oriented language POOSL was utilized to allo...
详细信息
In this paper we examine how a network processor can be modeled using object-oriented techniques. We examine the Intel IXP 1200 network processor and discuss how the object-oriented language POOSL was utilized to allow an evaluation of a system before implementing it with hardware and software components. With the case study of the IXP 1200, we illustrate the suitability of object-oriented languages for system level modeling and design exploration.
Nonuniform distance loop dependences are a known obstacle to find parallel iterations. To find the outermost loop parallelism in these "irregular" loops, a novel method is presented based on recurrence chain...
详细信息
Nonuniform distance loop dependences are a known obstacle to find parallel iterations. To find the outermost loop parallelism in these "irregular" loops, a novel method is presented based on recurrence chains. The scheme organizes nonuniformly dependent iterations into lexicographically ordered monotonic chains. While the initial and final iterations of monotonic chains form two parallel sets, the remaining iterations form an intermediate set that can be partitioned further. When there is only one pair of coupled array references, the nonuniform dependences are represented by a single recurrence equation. In that case, the chains in the intermediate set do not bifurcate and each can be executed as a WHILE loop. The independent and the initial iterations of monotonic dependence chains constitute the outermost parallelism. The proposed approach compares favorably with other treatments of nonuniform dependences in the literature. When there are multiple recurrence equations, a dataflow parallel execution can be scheduled using the technique to find maximum loop parallelism.
The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi...
详细信息
The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.
The availability of high bandwidth wide area networks enables the coupling of several computing resources - supercomputers or PC clusters - together to obtain a high performance distributed system. The question is to ...
详细信息
The availability of high bandwidth wide area networks enables the coupling of several computing resources - supercomputers or PC clusters - together to obtain a high performance distributed system. The question is to determine a suitable programming model that provides transparency, interoperability, reliability, scalability and performance. Since such systems appear as a combination of distributed and parallel systems, it is tempting to extend programming models that were associated to distributed or to parallel systems. Another choice is to combine the two different worlds into a single coherent one. A parallelism oriented model appears more adequate to program parallel codes while a distributed oriented model is more suitable to handle inter-code communications. This issue is addressed with the concept of parallel object. We have applied it to Corba so as to define a parallel Corba object: it is a collection of identical Corba objects with a single program multiple data (SPMD) execution model. This paper presents PACO++, a portable implementation of the concept of parallel Corba object. It examines how the different design issues have been tackled with. For example, scalability is achieved between two parallel Corba objects by involving all members of both collections in the communication: an aggregated bandwidth of 874 Mbit/s has been obtained on a 1 Gbit/s WAN. Such a performance is obtained while preserving the semantics of Corba and in particular interoperability with standard Corba objects.
Co-array Fortran (CAF) - a small set of extensions to Fortran 90 - is an emerging model for scalable, global address space parallel programming. CAF's global address space programming model simplifies the developm...
详细信息
ISBN:
(纸本)9780769522296
Co-array Fortran (CAF) - a small set of extensions to Fortran 90 - is an emerging model for scalable, global address space parallel programming. CAF's global address space programming model simplifies the development of single-program-multiple-data parallel programs by shifting the burden for managing the details of communication from developers to compilers. This paper describes CAFC - a prototype implementation of an open-source, multiplatform CAF compiler that generates code well-suited for today's commodity clusters. The CAFC compiler translates CAF into Fortran 90 plus calls to one-sided communication primitives. The paper describes key details of CAFC's approach to generating efficient code for multiple platforms. Experiments compare the performance of CAF and MPI versions of several NAS parallel benchmarks on an Alpha cluster with a Quadrics interconnect, an Itanium 2 cluster with a Myrinet 2000 interconnect and an Itanium 2 cluster with a Quadrics interconnect. These experiments show that CAFC compiles CAF programs into code that delivers performance roughly equal to that of hand-optimized MPI programs.
Message passing interface (MPI) is an effective programming technique for implementing parallel programs for distributed computation. As these applications run, a number of different types of irregularities can occur ...
详细信息
Message passing interface (MPI) is an effective programming technique for implementing parallel programs for distributed computation. As these applications run, a number of different types of irregularities can occur including those that result from intrusions, user misbehavior, corrupted data, deadlocks or failure of cluster components. We perform a comparison of different artificial intelligence (AI) techniques that can be used to implement a lightweight monitoring and detection system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of function library and OS system calls of parallel programs written with MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a 0% false positive rate in real-time, and we show that the added computational load on each node is small. Finally we demonstrate that simple deterministic methods perform poorly when the program flow grows in size and variety, and that more complex methods are required.
This work presents an interval optimization technique to compute the global minimum of the objective cost function arising in system redundancy and unit redundancy optimization problems. This new technique is compared...
详细信息
This work presents an interval optimization technique to compute the global minimum of the objective cost function arising in system redundancy and unit redundancy optimization problems. This new technique is compared with the lagrangian multipliers method, which is commonly used to deal with this type of optimization problems. Some illustrative examples are considered
An algorithm that generates fixed-polarity Reed-Muller (FPRM) spectral coefficients for five-valued functions is presented in this paper. The presented algorithm takes the disjoint cubes reduced representation of the ...
详细信息
An algorithm that generates fixed-polarity Reed-Muller (FPRM) spectral coefficients for five-valued functions is presented in this paper. The presented algorithm takes the disjoint cubes reduced representation of the input function and directly operates on it to obtain its FPRM spectral coefficients. It is simple and can be implemented with small amount of storage space. At the end of the paper, experimental results of the algorithm for several five-valued test files are shown. Generation of the used test files from MCNC binary benchmarks is also described.
This paper presents the research work directed regards the synthesis and implementation of a parallel-pipelined hardware genetic algorithm (PPHGA) utilizing very high speed integrated circuit hardware description lang...
详细信息
This paper presents the research work directed regards the synthesis and implementation of a parallel-pipelined hardware genetic algorithm (PPHGA) utilizing very high speed integrated circuit hardware description language (VHDL) for programming field programmable gate arrays (FPGAs). The main design is divided into several modules. The modules are autonomous in operation once the system starts to run. They communicate with each other using a handshaking protocol. Three applications are then experimented using the PPHGA to test its optimization power. These are linear interpolation, thermistor data processing, and vehicle acceleration computation.
暂无评论