Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in...
详细信息
ISBN:
(纸本)0780372239
Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in time. Deformations of the surface contain information about dynamics of the flow; in particular, it is possible to identify zones where flow stretching and foldings occur. Because the points on the surface are independent of each other, it is possible to calculate the trajectory of each point concurrently. Two parallel algorithms are studied; the first one for a shared memory Origin 2000 supercomputer and the second one for a distributed memory PC cluster. The technique is applied to a fluid moving by natural convection inside a cubic container.
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. ...
详细信息
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however; the probability of software DSM failures increases as the system size grows. This paper presents a new efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined.
The VIPER tool visualises the execution of a parallel program. VIPER focuses on the class of parallel programs constructed around the Mona Lisa parallel programming paradigm. Mona Lisa is a typed paradigm, providing t...
详细信息
The VIPER tool visualises the execution of a parallel program. VIPER focuses on the class of parallel programs constructed around the Mona Lisa parallel programming paradigm. Mona Lisa is a typed paradigm, providing the user with a small set of high level primitives for data exchange. The information provided by VIPER is directly related to the execution of these primitives. This makes the tool more suitable for behavioural analysis and debugging compared to paradigm independent tools such as ParaGraph. Five graphical views are supplied by VIPER. The most important ones are: an animation view showing the parallel program as a collection of interacting modules, and a space time view displaying the module interaction over time. The construction of these views is based on trace messages, produced by the parallel program during execution. The trace messages have to be correctly ordered to allow a consistent observation of the distributed computation. VIPER performs this run construction on the fly (allowing on-line visualisation), with minimal latency and maximum efficiency in terms of trace message generation, size and processing.< >
Exploiting clusters of workstations as a single computational resource is an attractive alternative to conventional multiprocessor technologies. However, the class of parallel applications that can benefit from cluste...
详细信息
Exploiting clusters of workstations as a single computational resource is an attractive alternative to conventional multiprocessor technologies. However, the class of parallel applications that can benefit from clusters is restricted due to their relatively high latency and low throughput-consequences of conventional networking. LANs offer the best performance but also limit the scope for effective clustering to a single room or building. Another major difference remains: multiprocessors can reasonably be programmed with the "error-free" assumption but applications cannot be run on distributed clusters without programming against the potential for remote faults. Emergent high speed switched networks such as ATM have the potential to reduce latency and increase bandwidth in the distributed scenario, and therefore extend the class of applications suitable for running on clusters. In addition, the virtual network capability of ATM removes some of the geographical constraints from clustering. But can ATM guarantee the type of application-level connection reliability which is taken for granted in multiprocessor environments? This paper reviews the capabilities of modern high-speed networks as exemplified by ATM and their relevance to parallel and distributed systems. In particular it asks if Quality of Service (QoS) can benefit parallel programming on distributed platforms.
Data parallelism is a powerful approach to parallel computation, particularly when it is used with complex data types. Categorical data types are extensions of abstract data types that structure computations in a way ...
详细信息
Data parallelism is a powerful approach to parallel computation, particularly when it is used with complex data types. Categorical data types are extensions of abstract data types that structure computations in a way that is useful for parallel implementation. In particular, they decompose the search for good algorithms on a data type into subproblems, all homomorphisms can be implemented by a single recursive, and often parallel, schema, and they are equipped with an equational system that can be used for software development by transformation.< >
This paper presents reduction recognition and parallel code generation strategies for distributed-memory multiprocessors. We describe techniques to recognize a broad range of implicit reduction operations, including t...
详细信息
This paper presents reduction recognition and parallel code generation strategies for distributed-memory multiprocessors. We describe techniques to recognize a broad range of implicit reduction operations, including those involving statements at multiple loop nesting levels and intermixed with conditional control flow. We introduce two new optimizations: factoring which increases data locality for SUM and PRODUCT reductions, and index encoding which enables a single global communication to accomplish both an extreme value reduction and an extreme value location reduction. We have implemented these techniques in the dHPF compiler for High Performance Fortran (HPF). We evaluate their effectiveness experimentally by compiling several reduction benchmarks with dHPF and two commercial HPF compilers, and comparing the performance of the generated code on an IBM SP2. Our results show that our recognition techniques are more powerful and that our index encoding and factoring optimizations can improve performance by a factor of two where they apply.
MICA (Mapped Interconnection-Cached Architecture) is a novel architecture combining large reconfigurable networks and small, fast on-line routing, crossbar switches. It offers a good match for parallel applications ex...
详细信息
MICA (Mapped Interconnection-Cached Architecture) is a novel architecture combining large reconfigurable networks and small, fast on-line routing, crossbar switches. It offers a good match for parallel applications exhibiting switching locality. Switching locality means that the need to "switch" or route the information to or from each PE is limited to a small set of sources or destinations. A parallel programming paradigm to attempt and minimize the movement of information by reconfiguring the relative proximity of the PEs is introduced. We aim to complete most communication requests with only two levels of routing decisions among a small set of channels. Multi-hop routing is not used as often, resulting in better performance.< >
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this...
详细信息
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this paper, we describe the OpenMP implementation in the Intel/spl reg/ C++ and Fortran compilers for Intel platforms. We present our major design consideration and decisions in the Intel compiler for generating efficient multithreaded codes guided by OpenMP directives and pragmas. We describe several transformation phases in the compiler for the OpenMP parallelization. In addition to compiler support, the OpenMP runtime library is a critical part of the Intel compiler. We present runtime techniques developed in the Intel OpenMP runtime library for exploiting thread-level parallelism as well as integrating the OpenMP support with other forms of threading termed as sibling parallelism. The performance results of a set of benchmarks show good speedups over the well-optimized serial code performance on Intel/spl reg/ Pentium- and Itanium-processor based systems.
Due to the attractive properties of the wavelet transform, wavelet filter banks are frequently used in areas such as signal processing and communication systems. Furthermore, the increasing computational power of micr...
详细信息
Due to the attractive properties of the wavelet transform, wavelet filter banks are frequently used in areas such as signal processing and communication systems. Furthermore, the increasing computational power of microprocessors leads to a leap in the use of techniques such as parallel processing, concurrent programming, and VHDL design. However, the inherently sequential tree structure of the traditional wavelet theory does not merge efficiently with the aforementioned techniques. This work presents an algorithm to generate uniform and non-uniform filter banks in a parallel structure. This algorithm generalizes the a Trous and Mallat algorithms for parallelized filter bank design, which is efficient for parallel processing, concurrent programming, and VHDL design. The algorithm generates a set of parallelized perfect-reconstruction filter banks for an arbitrary number of end-nodes of a traditional tree structure. The algorithm encompasses both the decimated and the undecimated cases. Examples of image and speech signal applications are presented.
Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance req...
详细信息
Streaming applications often require a parallel Model of Computation (MoC) to specify their application behavior and to facilitate mapping onto Multi-Processor System-on-Chip (MPSoC) platforms. Various performance requirements and resource budgets of embedded systems ask for an efficient design space exploration (DSE) approach to select the best design from a design space consisting of a large number of design choices. However, existing DSE approaches explore the design space that includes only architecture and mapping alternatives for an initial application specification given by the application designer. In this article, we first show that a design often might not be optimal if alternative specifications of a given application are not taken into account. We further argue that the best alternative specification consists of only independent and load-balanced application tasks. Based on the Polyhedral Process Network (PPN) MoC, we present an approach to analyze and transform an initial PPN to an alternative one that contains only independent processes if possible. Finally, by prototyping real-life applications on both FPGA-based MPSoCs and desktop multi-core platforms, we demonstrate that mapping the alternative application specification results in a large performance gain compared to those approaches, in which alternative application specifications are not taken into account.
暂无评论