The implementation of a parallel algorithm for estimating non-rigid motion vectors using a semi-fluid motion model applied to time-varying satellite imagery is described. Deformable motion tracking of non-rigid biolog...
详细信息
The implementation of a parallel algorithm for estimating non-rigid motion vectors using a semi-fluid motion model applied to time-varying satellite imagery is described. Deformable motion tracking of non-rigid biological objects and remotely sensed objects such as clouds, atmospheric aerosols and gases, polar sea ice, or ocean currents are important application domains for the Semi-fluid Motion Analysis (SMA) algorithm. The focus of this paper is on the parallelization of the SMA algorithm for the MasPar MP-2 architecture. Implementation issues that were evaluated in order to make it feasible to explore dense semi-fluid motion estimates of rapid-scan time-varying geostationary satellite imagery of clouds and weather patterns are described. Cloud motion vectors from the SMA algorithm can be used to estimate the wind field that would be useful in a variety of meteorological applications. Comparisons between the parallel and sequential implementations of the SMA algorithm, and with manual results are briefly discussed.
Homing sequences play an important role in the testing of finite state systems and have been used in a number of applications such as hardware fault detection, protocol verification, and learning algorithms etc. Recen...
详细信息
Homing sequences play an important role in the testing of finite state systems and have been used in a number of applications such as hardware fault detection, protocol verification, and learning algorithms etc. Recent applications of homing sequences involve large DFAs with thousands of states. Such applications motivate the design of a parallel algorithm for this problem. The author present a deterministic parallel algorithm of time complexity O(/spl radic/nlog/sup 2/n) using a polynomial number of processors on the CREW PRAM model. No faster deterministic parallel algorithm is known for this problem. The author also discusses the parallel complexity of some related problems.
We present new algorithmic techniques for a classical research problem, runtime redistribution of an array from one block-cyclic layout to another. Our methodology for reducing communication overheads is based on a ge...
详细信息
We present new algorithmic techniques for a classical research problem, runtime redistribution of an array from one block-cyclic layout to another. Our methodology for reducing communication overheads is based on a generalized circulant matrix formalism. Using this formalism, we derive direct, indirect, and hybrid communication schedules for the cyclic redistribution problem when the block size changes by an integer factor K. We have also developed formulae to estimate the timing performance of each of these schedules for a given parallel machine and redistribution problem. In our indirect communication schedule, blocks are moved from a source processor to a destination processor through intermediate "relay" processors. This reduces the number of communication steps by an order of magnitude, in comparison with previous approaches. This algorithm performs cyclic(x) to cyclic(Kx) redistribution on P processors in [log/sub 2/K]+2 steps. Implementations of these algorithms on the Cray T3D and on the IBM SP-2 show superior performance over previous approaches. Since our algorithms are developed using MPI, they can be easily ported to different application environments. Our techniques can be used in the design of scalable redistribution libraries, in efficient implementations of the REDISTRIBUTE directive of HPF and in developing parallel algorithms for various HPC applications.
Many irregular scientific computing problems can be modeled by directed acyclic task graphs (DAGs). We present an efficient run-time system for executing general asynchronous DAG schedules on distributed memory machin...
详细信息
ISBN:
(纸本)0818672552
Many irregular scientific computing problems can be modeled by directed acyclic task graphs (DAGs). We present an efficient run-time system for executing general asynchronous DAG schedules on distributed memory machines. Our solution tightly integrates the run-time scheme with a fast communication mechanism to eliminate unnecessary overhead in message buffering and copying, and takes advantage of task dependence properties to ensure the correctness of execution. We demonstrate the applications of this scheme in sparse LU and Cholesky factorizations for which actual speedups have been hard to obtain in the literature because parallelism in these problems is irregular and limited. Our experiments on Meiko CS-2 show the promising results of our approach in exploiting irregular task parallelism with mixed granularities.
Introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented DAG consistency in software for the C...
详细信息
Introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented DAG consistency in software for the Cilk multithreaded runtime system running on a CM5 Connection Machine. Our implementation includes a DAG-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassen's (1969) matrix multiplication algorithm and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number F/sub P/ of page faults incurred by a user program running an P processors can be related to the number F/sub 1/ of page faults running serially by the formula F/sub P//spl les/F/sub 1/+2Cs, where C is the cache size and s is the number of thread migrations executed by Cilk's scheduler.
Metacomputing is the aggregation of distributed and high-performance resources on coordinated networks. With careful scheduling, resource-intensive applications can be implemented efficiently on metacomputing systems ...
详细信息
ISBN:
(纸本)9780818675829
Metacomputing is the aggregation of distributed and high-performance resources on coordinated networks. With careful scheduling, resource-intensive applications can be implemented efficiently on metacomputing systems at the sizes of interest to developers and users. In this paper, we focus on the problem of scheduling applications on metacomputing systems. We introduce the concept of application-centric scheduling in which everything about the system is evaluated in terms of its impact on the application. Application-centric scheduling is used by virtually all metacomputer programmers to achieve performance on metacomputing systems. We describe two successful metacomputing applications to illustrate this approach, and describe AppLeS (Application-Level Scheduling) agents which generalize the application-centric scheduling approach. Finally, we show preliminary results which compare AppLeS-derived schedules with conventional strip and blocked schedules for a 2D Jacobi code.
In this paper we address the design problem of high speed recursive digital filters for real time applications. In contrast to the nonrecursive case, there exist feedback loops in recursive filters which often become ...
详细信息
In this paper we address the design problem of high speed recursive digital filters for real time applications. In contrast to the nonrecursive case, there exist feedback loops in recursive filters which often become the performance bottleneck and prevent the circuits from high speed operation. Design tactics such as pipelining andparallelprocessing offer little help in this case. To achieve more parallelism, a look-ahead transformation applied at the algorithm level may be employed. This however, often causes a drastic increase in the hardware complexity. In this paper we propose a new design approach based on distributed Arithmetic (DA) for the recursive filters. It can outperform the traditional bit parallel design in both pipelining period and initiation interval. To illustrate this new approach, we present 2 different systolic array designs for the ARMA filter. Finally, both designs are implemented by 0.8 /spl mu/m SPDM CMOS technology. For the 4-tap 8-bit wide designs, the simulation results show they can operate at 142.6 and 142.8 MHz, respectively.
A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented. The description includes a rigorous derivation of a geometric conservation l...
详细信息
A three-field arbitrary Lagrangian-Eulerian (ALE) finite element/volume formulation for coupled transient aeroelastic problems is presented. The description includes a rigorous derivation of a geometric conservation law for flow problems with moving boundaries and unstructured deformable meshes. The solution of the coupled governing equations with a mixed explicit (fluid)/implicit (structure) staggered procedure is discussed with particular reference to accuracy, stability, distributed computing, I/O transfers, subcycling andparallelprocessing. A general and flexible framework for implementing partitioned solution procedures for coupled aeroelastic problems on heterogeneous and/or parallel computational platforms is described. This framework and the explicit/implicit partitioned procedures are demonstrated with the numerical investigation on an iPSC-860 massively parallel processor of the instability of flat panels with infinite aspect ratio in supersonic airstreams.
parallel computers with distributed memory are gaining popularity on account of their optimal scalability. However, their efficient use requires a locality-preserving mapping of the application's underlying graph ...
详细信息
parallel computers with distributed memory are gaining popularity on account of their optimal scalability. However, their efficient use requires a locality-preserving mapping of the application's underlying graph structure onto the physical topology of the target platform. PROMOTER is a parallel programming model which supports an automatic mapping by the compiler by making the graph structures explicit and thus processable by the implementation. This article describes how this is done for applications with irregular and dynamic spatial structures.
The paper proposes and analyzes a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel pr...
详细信息
The paper proposes and analyzes a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel programming. If parallelapplications are composed of a set of modules with a well-defined scope of interaction, the proposed model can induce a memory access latency time that only logarithmically increases with the number of nodes. Experimental results show the effectiveness of the model with a transputer-based implementation.
暂无评论