The computation of a one-dimensional FFT on a c-dimensional torus multicomputer is analyzed. Different approaches are proposed which differ in the way they use the interconnection network. The first approach is based ...
详细信息
The computation of a one-dimensional FFT on a c-dimensional torus multicomputer is analyzed. Different approaches are proposed which differ in the way they use the interconnection network. The first approach is based on the multidimensional index mapping technique for the FFT computation. The second approach starts from a hypercube algorithm and then embeds the hypercube onto the torus. The third approach reduces the communication cost of the hypercube algorithm by pipelining the communication operations. A novel methodology to pipeline the communication operations on a torus is proposed. Analytical models are presented to compare the different approaches. This comparison study shows that the best approach depends on the number of dimensions of the torus and the communication start-up and transfer times. The analytical models allow us to select the most efficient approach for the available machine.< >
The maximum speedup of a multiprocessor system is limited by the sequential part of an algorithm and in loosely coupled processor systems, a large part of this sequentiality is caused by the communication between proc...
详细信息
The maximum speedup of a multiprocessor system is limited by the sequential part of an algorithm and in loosely coupled processor systems, a large part of this sequentiality is caused by the communication between processors. As this communication is dependent on the distribution of data the data distribution must be optimised in order to achieve the maximum speedup. In this paper we present a new method of achieving this optimisation for loosely coupled multiprocessors using a branch and bound technique based on the Moore-Skelboe interval arithmetic. Two key issues of this algorithm have been addressed, namely, the branch selection criterion, and an efficient data structure to keep track of the branches. A blocked data structure is shown to have the desirable properties of dynamic list length, and minimal disk thrashing. When this method is applied to an FFT algorithm running on a pipeline of transputers, the optimised data distribution is shown to yield an improved speedup over the equal and random distributions, which do not take into account communication overheads.< >
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallelprogramming: program, processor and netwo...
详细信息
ISBN:
(纸本)9780818670886
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallelprogramming: program, processor and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and sub-optimal schedules of loops for load balancing, communication optimizations and network contention. Experiments show that a significant improvement of performance is achieved using our techniques.
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages work load migrations within its neighborhood. This paper compares a couple of fairly we...
详细信息
With nearest neighbor load balancing algorithms, a processor makes balancing decisions based on its local information and manages work load migrations within its neighborhood. This paper compares a couple of fairly well-known nearest neighbor algorithms, the dimension exchange and the diffusion methods and their variants in terms of their performances in both one-port and all-port communication architectures. It turns out that the dimension exchange method outperforms the diffusion method in the one-port communication model, and that the strength of the diffusion method is in asynchronous implementations in the all-port communication model. The underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the k-ary n-cube.< >
This paper focus on the design of adaptive mixed-signal fuzzy chips. These chips have parallel architecture and feature electrically-controllable surface maps. The design methodology is based on the use of composite t...
详细信息
This paper focus on the design of adaptive mixed-signal fuzzy chips. These chips have parallel architecture and feature electrically-controllable surface maps. The design methodology is based on the use of composite transistors-modular and well suited for design automation. This methodology is supported by dedicated, hardware-compatible learning algorithms that combine weight-perturbation and outstar.
We present algorithms for optimally scheduling computations that comprise a sequence of complete up- and/or down-sweeps on a complete binary tree, on a parallel architecture in which the communication delay between an...
详细信息
We present algorithms for optimally scheduling computations that comprise a sequence of complete up- and/or down-sweeps on a complete binary tree, on a parallel architecture in which the communication delay between any two processors is uniform. Such computations include, for instance, those that implement broadcast, accumulation, and the parallel-prefix operator; such architectures include, for instance, networks of workstations. Our schedules are optimal in the sense of having the actual minimum time-to-completion-not just on approximation thereof-considering the time for both computation and communication. We concentrate on schedules for fine-grain tree-sweep computations-wherein communication costs are rather large relative to per-task computation cost.
We define transformable computing systems as those machines that use the reconfigurable aspects of field programmable gate arrays (FPGA) to implement an algorithm. Researchers throughout the world have shown that comp...
详细信息
We define transformable computing systems as those machines that use the reconfigurable aspects of field programmable gate arrays (FPGA) to implement an algorithm. Researchers throughout the world have shown that computationally intensive software algorithms can be transposed directly into hardware design for extreme performance gain. The on-the-fly use of digital designs in a reconfigurable computer can easily be utilized from within existing software applications. Hardware objects are converted digital designs called from within a software program. This H.O.T. (Hardware Object Technology) method of programming transformable computers is the subject of this paper.< >
This paper presents highly parallel VLSI structures for linear convolution. Our methodology implements Toom's algorithm and is based on mapping a modified version of the tensor product factorization proposed by Gr...
详细信息
This paper presents highly parallel VLSI structures for linear convolution. Our methodology implements Toom's algorithm and is based on mapping a modified version of the tensor product factorization proposed by Granata et al. (1991). The resulting networks have very simple structure, highly regular topology, and use simple bit-serial devices. Additionally, the proposed networks have very small depth and contain only a single stage of multipliers, while all other stages contain adders only.
Active messages provide a low latency communication architecture which on modern parallel machines achieves more than an order of magnitude performance improvement over more traditional communication libraries. This p...
详细信息
Active messages provide a low latency communication architecture which on modern parallel machines achieves more than an order of magnitude performance improvement over more traditional communication libraries. This paper discusses the experience we gained while implementing active messages on the Meiko CS-2, and discusses implementations for similar architectures. During our work we have identified that architectures which only support efficient remote write operations (or DMA transfers as in the case of the CS-2) make it difficult to transfer both data and control as required by active messages. Traditional network interfaces avoid this problem because they have a single point of entry which essentially acts as a queue. To efficiently support active messages on modern network communication co-processors, hardware primitives are required which support this queue behavior The overcame this problem by producing specialized code which runs on the communications co-processor and supports the active messages protocol. Our implementation of active messages results in a one-way latency of 12.3 /spl mu/s and achieves up to 39 MB/s for bulk transfers. Both numbers are close to optimal for the current Meiko hardware and are competitive with performance of active messages on other hardware platforms.< >
Evaluation of predicates on the state of a distributed system is complicated by the lack of either a common clock or common memory. In these systems, message passing is often used to order local events globally. This ...
详细信息
Evaluation of predicates on the state of a distributed system is complicated by the lack of either a common clock or common memory. In these systems, message passing is often used to order local events globally. This leads to a partial, causal ordering of system events. Predicate evaluation algorithms based on this causal ordering generally cannot determine if an unstable predicate was true at some instant of real time without freezing the underlying application. They only determine whether or not the predicate could have occurred. This ordering is sufficient for evaluating stable predicates, but algorithms based on it require a good deal of message passing. We present two algorithms which conclusively evaluate both stable and a restricted class of unstable predicates for a given execution. The algorithms are based on the use of roughly synchronized clocks. We present an algorithm for scheduled evaluation as well as a centralized detection algorithm.
暂无评论