作者:
A. BodeInstitut für Informatik
Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universität München Munchen Germany
This article covers research at Technische Universitat Munchen on distributed and parallelarchitectures and applications. First, an overview on the parallel processing research organization is given. The second main ...
详细信息
This article covers research at Technische Universitat Munchen on distributed and parallelarchitectures and applications. First, an overview on the parallel processing research organization is given. The second main topic covers an integrated hierarchical programming environment TOPSYS for parallel and distributed systems developed as part of the research grant.< >
The paper describes-from a software engineering perspective-a framework for the formal development of parallelalgorithms on arbitrary architectures. The algorithms are synthesised in a transformational way, i.e. by a...
详细信息
The paper describes-from a software engineering perspective-a framework for the formal development of parallelalgorithms on arbitrary architectures. The algorithms are synthesised in a transformational way, i.e. by applying correctness preserving rewrite rules to a formal specification. The architectures are modelled by skeletons-higher order functions that represent elementary computations on a certain architecture. It is shown that the combination of transformational programming and skeletons stimulates the reuse of program derivations. Furthermore, interskeleton transformations will provide the means for architecture independent program development.< >
We present algorithms for optimally scheduling computations that comprise a sequence of complete up- and/or down-sweeps on a complete binary tree, on a parallel architecture in which the communication delay between an...
详细信息
We present algorithms for optimally scheduling computations that comprise a sequence of complete up- and/or down-sweeps on a complete binary tree, on a parallel architecture in which the communication delay between any two processors is uniform. Such computations include, for instance, those that implement broadcast, accumulation, and the parallel-prefix operator; such architectures include, for instance, networks of workstations. Our schedules are optimal in the sense of having the actual minimum time-to-completion-not just on approximation thereof-considering the time for both computation and communication. We concentrate on schedules for fine-grain tree-sweep computations-wherein communication costs are rather large relative to per-task computation cost.
Evaluation of predicates on the state of a distributed system is complicated by the lack of either a common clock or common memory. In these systems, message passing is often used to order local events globally. This ...
详细信息
Evaluation of predicates on the state of a distributed system is complicated by the lack of either a common clock or common memory. In these systems, message passing is often used to order local events globally. This leads to a partial, causal ordering of system events. Predicate evaluation algorithms based on this causal ordering generally cannot determine if an unstable predicate was true at some instant of real time without freezing the underlying application. They only determine whether or not the predicate could have occurred. This ordering is sufficient for evaluating stable predicates, but algorithms based on it require a good deal of message passing. We present two algorithms which conclusively evaluate both stable and a restricted class of unstable predicates for a given execution. The algorithms are based on the use of roughly synchronized clocks. We present an algorithm for scheduled evaluation as well as a centralized detection algorithm.
We first consider the symbolic scheduling and performance prediction of a partitioned single loop on message passing architectures with non zero communication and a sufficient number of processors. The loop body conta...
详细信息
We first consider the symbolic scheduling and performance prediction of a partitioned single loop on message passing architectures with non zero communication and a sufficient number of processors. The loop body contains a set of coarse grain tasks whose computational weights change during the course of the iterations. Using the macro dataflow task: model and software pipelining techniques, we develop an algorithm for computing a heuristic schedule, and provide analytic and experimental results on the correctness and asymptotic performance of this schedule. Using this result, we can show the impact of partitioning on the performance of parallelization. While this result is effective for a class of loops, there are many other methods proposed for loop parallelization. An interesting fundamental question is whether every instance of a nested loop can be efficiently executed, We present some positive and negative results on this issue.
We propose an algorithm for scheduling and allocation of parallel programs to message-passing architectures. The algorithm considers arbitrary computation and communication costs, arbitrary network topology, link cont...
详细信息
We propose an algorithm for scheduling and allocation of parallel programs to message-passing architectures. The algorithm considers arbitrary computation and communication costs, arbitrary network topology, link contention and underlying communication routing strategy. While our technique is static, the algorithm is quasi dynamic because it is not specific to any particular system topology and thus can be used at run-time for the processor configuration available at that time. The proposed algorithm, called Bubble Scheduling and Allocation (BSA) algorithm, works by first serializing the task graph and "injecting" all the tasks to one processor. The parallel tasks are then "bubbled up" to other processors and are inserted at appropriate time slots. The edges among the tasks are also scheduled by treating communication links between the processors as resources. The scheduling of messages on the links depends on the routing strategy, such as circuit switching and wormhole routing, of the underlying network. The proposed algorithm has admissible time complexity and is suitable for regular as well as irregular task graph structures.
Monte Carlo (MC) and molecular dynamics (MD) simulations are powerful tools for understanding the properties of systems of interacting electrons and phonons in a solid. When mobile electrons are studied, these simulat...
详细信息
Monte Carlo (MC) and molecular dynamics (MD) simulations are powerful tools for understanding the properties of systems of interacting electrons and phonons in a solid. When mobile electrons are studied, these simulations are limited to a few hundred particles. More powerful machines and algorithms must be used to address many of the most important issues in the field. We present results from using the p4 parallelprogramming system on a variety of parallelarchitectures to conduct MC and MD simulations.< >
Many computer vision and image processing (CVIP) operations can be represented as a sequence of tasks with nested loops, specified by the visual programming language Khoros. This paper addresses the automatic partitio...
详细信息
ISBN:
(纸本)0818671955
Many computer vision and image processing (CVIP) operations can be represented as a sequence of tasks with nested loops, specified by the visual programming language Khoros. This paper addresses the automatic partitioning and scheduling of such operations on distributed memory multiprocessors. The major difficulties in determining the optimal image data distribution for each task are that the number of processors available and the size of the input image may vary at the run time, and the cost of some image processing operations may be data-dependent. This paper proposes a compile-time processor assignment and data partitioning scheme that optimizes the average run-time performance of task chains with nested loops by considering the data redistribution overheads and possible run-time parameter variations. This paper presents the theoretical analysis and experimental results on a Meiko CS-2 distributed memory machine.
We present an iteration and data partitioning approach for DOALL loops on distributed memory systems. The method first examines the highest amount of parallelism (available parallelism) which could be potentially expl...
详细信息
We present an iteration and data partitioning approach for DOALL loops on distributed memory systems. The method first examines the highest amount of parallelism (available parallelism) which could be potentially exploited in a loop nest. It then examines the amount of communication overhead which can potentially nullify the benefits due to parallelism and attempts to maximally eliminate the communication to minimize the loop completion time by trading parallelism to a minimal extent. This is achieved by determining the directions of iteration space partitioning which result in minimum communication. Finally, in order to generate a load balanced partition with respect to computation+communication, the method uses a new larger partition owns rule to distribute the underlying data. Necessary theoretical framework has been developed and the merit of the method is shown through a performance evaluation on Cray T3D.
Decoupled pre-fetching is a technique for reducing the page miss overheads in Distributed Shared Memory systems by separating out those instructions responsible for data fetching from the main instruction stream and r...
详细信息
Decoupled pre-fetching is a technique for reducing the page miss overheads in Distributed Shared Memory systems by separating out those instructions responsible for data fetching from the main instruction stream and running them on a separate CPU whose function is to predict store accesses ahead of time. This approach differs from other pre-fetching approaches in that the predictions of data usage are obtained dynamically from partial evaluation of the program and this promises to produce considerably better performance in circumstances where the access patterns are non-regular and cannot be extracted by static analysis of the program. This paper reviews the techniques of decoupled pre-fetching with particular emphasis on Cache only Memory architectures (COMA). It then presents a more thorough evaluation of the ideas than has previously been attempted using some of the SPLASH benchmarks. It is shown that the techniques perform well on some programs but that, as expected, the benefits of pre-fetching are negated when there is a high rate of data invalidation caused by global updating.
暂无评论