Performance tuning of parallel programs, considering the current status and future developments in parallel programming paradigms and parallel system architectures, remains an important topic even if the single CPU pe...
详细信息
This paper discusses an extension of Haskell by support for nested data-parallel programming in the style of the special-purpose language Nesl. The extension consists of a parallel array type, array comprehensions, an...
详细信息
A relatively new trend in parallel programming scheduling is the so-called mixed task and data scheduling. It has been shown that mixing task and data parallelism to solve large computational applications often yields...
详细信息
Neurocluster based on NM6403 neuroprocessors architecture, system software and programming technology are discussed. Special attention was paid to operating system structure, data and control flow between subsystems, ...
详细信息
We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be repli...
详细信息
We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using reliable, totally-ordered broadcast to send update methods to all machines containing a copy, The model has been implemented in the Manta high-performance Java system. We evaluate system performance both with microbenchmarks and with a set of five parallel applications. For the applications, we also evaluate ease of programming, compared to RMI implementations. We present performance results for a Myrinet-based workstation cluster as well as for a wide-area distributed system consisting of four such clusters. The microbenchmarks show that updating a replicated object on 64 machines only takes about three times the RMI latency in Manta, Applications using Manta's object replication mechanism perform at least as fast as manually optimized versions based on RMI, while keeping the application code as simple as with naive versions that use shared objects without taking locality into account. Using a replication mechanism in Manta's runtime system enables several unmodified applications to run efficiently even on the wide-area system. Copyright (C) 2001 John Wiley & Sons, Ltd.
This paper contributes to the solution of several open problems with parallel programming tools and their integration with performance evaluation environments. First, we propose interactive compilation scenarios inste...
详细信息
This paper contributes to the solution of several open problems with parallel programming tools and their integration with performance evaluation environments. First, we propose interactive compilation scenarios instead of the usual black-box-oriented use of compiler tools. In such scenarios, information gathered by the compiler and the compiler's reasoning are presented to the user in meaningful ways and on-demand. Second, a tight integration of compilation and performance analysis tools is advocated. Many of the existing, advanced instruments for gathering performance results are being used in the presented environment and their results are combined in integrated views with compiler information and data from other tools. Initial instruments that assist users in "data mining" this information are presented and the need for much stronger facilities is explained. The URSA Family provides two tools addressing these issues. URSA MINOR supports a group of users at a specific site, such as a research or development project. URSA MAJOR complements this tool by making available the gathered results to the user community at large via the World-wide Web. This paper presents objectives, functionality, experience, and next development steps of the URSA tool family. Two case studies are presented that illustrate the use of the tools for developing and studying parallel applications and for evaluating parallelizing compilers.
The aim of this paper is to search for techniques to accelerate simulations exploiting the parallelism available in current multicomputers, and to use these techniques to study a class of Petri nets called high-level ...
详细信息
The aim of this paper is to search for techniques to accelerate simulations exploiting the parallelism available in current multicomputers, and to use these techniques to study a class of Petri nets called high-level algebraic nets. These nets exploit the rich theory of algebraic specifications for high-level Petri nets. They also gain a great deal of modelling power by representing dynamically changing items asstructured tokenswhereas algebraic specifications turned out to be an adequate and flexible instrument for handlingstructured items. We focus on ECATNets (Extended Concurrent Algebraic Term Nets), a kind of high-level algebraic Petri nets with limited capacity placesThree distributed simulation techniques have been considered: asynchronous conservative, asynchronous optimistic and synchronous. These algorithms have been implemented in a network of workstations with MPI (Message Passing Interface). The influence that factors such as the characteristics of the simulated models, the organisation of the simulators and the characteristics of the target multicomputer have in the performance of the simulations have been measured and characterizedIt is concluded that distributed simulation of ECATNets on a multicomputer system can in fact gain speedup over the sequential simulation, and this can be achieved even for small scale simulation models.
Experiments of large data sets are computationally expensive. Signal processing analysis on a single CPU leads to unacceptably long execution times. The paper presents initial experiments on calculating the time-frequ...
详细信息
Experiments of large data sets are computationally expensive. Signal processing analysis on a single CPU leads to unacceptably long execution times. The paper presents initial experiments on calculating the time-frequency power spectrum using the coarse-grained parallel programming technique. Experimental speedup factors are given and discussed. The measured speedup factor of the time-frequency power spectrum parallel calculation process is sublinear which indicates that the time-frequency power spectrum is a suitable application for parallel programming. The parallel efficiency is acceptable with the lowest value of 75.1% occurring at N = 10. The maximum speedup factor of 9.1 is obtained when N = 12 at 75.3% of efficiency.
Presents a distributed implementation of the Structured Gamma programming language, a language based on the Gamma multi-set rewriting paradigm. Structured Gamma offers, in addition to the advantages introduced by Gamm...
详细信息
ISBN:
(纸本)0769511538
Presents a distributed implementation of the Structured Gamma programming language, a language based on the Gamma multi-set rewriting paradigm. Structured Gamma offers, in addition to the advantages introduced by Gamma, implicit concurrent behavior and a type system where not only types themselves are defined but also the automatic verification of user-defined types at compilation time. The problems and mechanisms involved in an MPI-based implementation of Structured Gamma using a type-checking engine based on the most general unifier (MGU) are investigated.
We consider program modules, e.g. procedures, functions, and methods as the basic method to exploit speculative parallelism in existing codes. We analyze how much inherent and exploitable parallelism exists in a set o...
详细信息
We consider program modules, e.g. procedures, functions, and methods as the basic method to exploit speculative parallelism in existing codes. We analyze how much inherent and exploitable parallelism exists in a set of C and Java programs on a set of chip-multiprocessor architecture models, and identify what inherent program features, as well as architectural deficiencies, that limit the speedup. Our data complement previous limit studies by indicating that the programming style-object-oriented versus imperative-does not seem to have any noticeable impact on the achievable speedup. Further, we show that as few as eight processors are enough to exploit all of the inherent parallelism. However, memory-level data dependence resolution and thread management mechanisms of recent CMP proposals may impose overheads that severely limit the speedup obtained.
暂无评论