This paper proposes an advanced reconfiguration scheme using row-column bypassing and direct replacement for two-dimensional mesh-connected processing-node arrays that makes an array usable for massively parallel comp...
详细信息
ISBN:
(纸本)0769509363
This paper proposes an advanced reconfiguration scheme using row-column bypassing and direct replacement for two-dimensional mesh-connected processing-node arrays that makes an array usable for massively parallel computing and stand-alone computing in an efficient divided manner. This scheme uses an array providing a switching circuit in every node for row-column bypassing and a simple bypass network with a tree structure allocated to the array by graph-node coloring with a minimum inter-node distance of three for direct replacement. It can reconfigure a subarray with a regular matrix of free nodes usable for parallel computing in the array while allowing a small delay in the mesh connections but maintaining a communication path from every busy node being used as stand-alone computing to the outside of the array. The direct replacement is used for substitution of busy nodes which are not covered by row-column bypassing with free nodes located in the rows or columns to be bypassed, helping to enlarge the size of the reconfigured subarray. The bypass allocation with a minimum distance of three enables distributed communications and simple routing in the array while attaining a large success probability of the direct replacement. The proposed scheme is advantageous for constructing fault-tolerant massively parallel systems by using personal computers or workstations as processing nodes and Ethernet devices for interconnections.
We demonstrate an implementation methodology of sequential and distributed simulations using Java programming: two specific algorithms based on Java threads (single-channel and multi-channel algorithms) are proposed. ...
详细信息
We demonstrate an implementation methodology of sequential and distributed simulations using Java programming: two specific algorithms based on Java threads (single-channel and multi-channel algorithms) are proposed. From this point of view, the events are timely ordered into events lists and controlled by threads with respect to clock cycles. Each thread possesses an event list. The threads are globally timed in the sequential case by one clock, meanwhile in the distributed case they are locally clocked. The main application that is targeted by this work is the simulation of hardware/software systems, where different components are described by threads and obey a multi-clocked system.
In this paper, we report on experiences with P/sup 3/T+, a performance estimator for distributed and parallel programs which is used to examine at compile time the performance outcome of changes in code, problem and m...
详细信息
In this paper, we report on experiences with P/sup 3/T+, a performance estimator for distributed and parallel programs which is used to examine at compile time the performance outcome of changes in code, problem and machine sizes, and target architectures. P/sup 3/T+ computes a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. It is unique in that it models programs, code transformations and parallel and distributed architectures and derives a performance prediction based on all three of these elements. P/sup 3/T+ is the successor tool of P/sup 3/T which computed a similar set of performance parameters, however for parallel programs only. P/sup 3/T+ has been re-designed and re-implemented from scratch and goes beyond P/sup 3/T by extending the class of programs that cart be handled and by employing several novel estimation methods (symbolic analysis, simulation, pre-measured kernel codes, etc.). The core part of this paper reports on the evaluation of P/sup 3/T+ to demonstrate both accuracy and usefulness of this tool for realistic kernel codes taken from real-world applications (pricing of financial derivatives and quantum mechanical calculations of solids).
Various augmenting mechanisms have been proposed to enhance the communication efficiency of mesh-connected computers (MCCs). One major approach is to add nonconfigurable buses for improved broadcasting. A typical exam...
详细信息
Various augmenting mechanisms have been proposed to enhance the communication efficiency of mesh-connected computers (MCCs). One major approach is to add nonconfigurable buses for improved broadcasting. A typical example is the mesh-connected computer with multiple buses (MMB). In this paper, we propose a new class of generalized MMBs, the improved generalized MMBs (IMMBs). Each processor in an IMMB is connected to exactly two buses. We show the power of IMMBs by considering semigroup and prefix computations. Specifically, we show that semigroup and prefix computations on N operands, and data broadcasting all take O(log N) time on IMMBs. This is the first O(log N) time algorithm for these problems on arrays with fixed broadcasting buses.
The MPEG-4 audio standard provides a toolset for audio synthesis and audio processing, i.e. structured audio (SA). SA permits one to describe algorithms through its Structured Audio Orchestra Language (SAOL) programmi...
详细信息
The MPEG-4 audio standard provides a toolset for audio synthesis and audio processing, i.e. structured audio (SA). SA permits one to describe algorithms through its Structured Audio Orchestra Language (SAOL) programming language. Unlike some other languages of the same type, SAOL has a sample-by-sample execution structure, and this makes particularly important the overhead computation in case of an interpreted decoder implementation. This paper describes the design of an efficient virtual architecture able to exploit the data level parallelism contained in many audio synthesis and processing algorithms and to consistently reduce the implementation overhead through a block-by-block execution.
We examine the results of major previous attempts to apply genetic and evolutionary computation (GEC) to image processing. In many problems, the accuracy (quality) of solutions obtained by GEC-based methods is better ...
详细信息
We examine the results of major previous attempts to apply genetic and evolutionary computation (GEC) to image processing. In many problems, the accuracy (quality) of solutions obtained by GEC-based methods is better than that obtained by other methods such as neural networks and simulated annealing. However the computation time required is satisfactory in some problems, whereas it is unsatisfactory in other problems. We consider the current problems of GEC-based methods and present the following measures to achieve still better performance: (1) utilizing competent GEC, (2) incorporating other search algorithms such as local hill climbing algorithms, (3) hybridizing with conventional image processing algorithms; (4) modeling the given problem with as smaller parameters as possible, and (5) using parallel processors to evaluate the fitness function.
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
ISBN:
(纸本)9781581131857
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFAT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Efficient transposition of large-scale matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the state-of-the-art architectures, data transfer time and inde...
详细信息
Efficient transposition of large-scale matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in the state-of-the-art architectures, data transfer time and index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers all these costs and reduces the overall execution time. The reduction of the overall execution time is achieved by using two techniques: (1) writing the data onto disk in predefined patterns and (2) balancing the numbers of disk read and write operations. Even though our approach may increase the number of I/O operations for some cases it results in an overall reduction in the execution time. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into two buffers. The expensive in-processor permutation is replaced by data collection operations. Our algorithm is analyzed using the well-known Linear Model and the parallel Disk Model. The experimental results on a Sun Enterprise and a DEC Alpha show that our algorithm reduces the execution time by about 50%, compared with the best known algorithms in the literature.
Software for qualitative analysis of the kinetic equations of atmospheric chemistry on the base of parallel algorithms are developed and tested. It includes dispatcher for managing parallel tasks and the parallel algo...
详细信息
Software for qualitative analysis of the kinetic equations of atmospheric chemistry on the base of parallel algorithms are developed and tested. It includes dispatcher for managing parallel tasks and the parallel algorithms for solving algebraic and differential equations on computer clusters. Tests have shown the efficiency of the computing using this architecture and the parallelprogramming. The data format for translation from symbolic kinetic equations to procedures of computation block is discussed as well.
暂无评论