Two-dimensional (2D) Discrete Fourier Transform (DFT) frequently needs to be performed in the digital image processing. Although the computing time of 2D DFT can be dramatically reduced by using 2D Fast Fourier Transf...
详细信息
ISBN:
(纸本)0819425885
Two-dimensional (2D) Discrete Fourier Transform (DFT) frequently needs to be performed in the digital image processing. Although the computing time of 2D DFT can be dramatically reduced by using 2D Fast Fourier Transform (FFT), the processing speed of a very large array is yet intolerable. The development of parallel processing system promotes the application of 2D FFT. In this paper, we present the implementation of 2D FFT as a general procedure by row-column method and vector-radix method based on a general-purpose massively parallel processing system-DAWN 1000 developed in China. Even though the 2D FFT has parallel characteristics in nature, the requirement of corner-turning and the existence of data communication make its implementation more complicated. We analyze the impact of the machine capacity and the computing complexity on the algorithm efficiency and evaluate the implementation in terms of the arithmetic operations as well as the data transfer. The comparison of the two methods shows the fact that each method has its own advantages and disadvantages. Combining their traits, we design a new implementation algorithm concerning its flexibility, the efficiency and the complexity of the communication. As an example, we fulfill the spaceborne SAR Image processing by using the new approach.
Object dataflow is a popular approach used in parallel rendering. The data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object d...
详细信息
Object dataflow is a popular approach used in parallel rendering. The data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object dataflow methods were implemented on shared memory architectures and exploited spatial coherency to reduce hardware cache misses. In this paper, we propose an efficient model for object dataflow parallel volume rendering on message passing machines. The algorithm is introduced and its ray storage mechanism is used to support latency hiding by postponing computation on inactive rays. Memory usage is optimized by letting objects migrate and replicate at different processors rather than the common static assignments. Our cache-only-memory approach uses a distributed-directory scheme to trace the location of objects at other nodes. A mechanism to minimize network congestion was implemented which optimizes channel utilization. Unlike previous methods, our approach can benefit from temporal coherence and effectively minimizes communication costs during animation on limited-bandwidth multiprocessing environments. We report results of the algorithm's implementation on several platforms like Cray T3D, Convex SPP and DEC-alpha cluster of workstations (COWs), and achieved higher efficiency and scalability than existing algorithms.
In parallel and distributed systems, an important issue in managing a decentralized task queue is load balancing among multiple processors. In this paper, we propose a scheme for this problem by using a symmetric broa...
详细信息
In parallel and distributed systems, an important issue in managing a decentralized task queue is load balancing among multiple processors. In this paper, we propose a scheme for this problem by using a symmetric broadcast network (SBN) which provides an efficient and robust communication pattern between processors. We compare the performance of SBN-based load balancing algorithm with randomization-based algorithm, gradient algorithm, and extended gradient algorithm on a broad range of computing and communication platforms. All four algorithms were first implemented on an 8-processor Intel's iPSC-2, a hypercube-based multicomputer. Then, the programs were ported to parallel Virtual Machine (PVM). Using PVM, we compared all four algorithms on (i) an 8-processor bus-based Silicon Graphics multiprocessor (SGI), (ii) two DEC's Alpha workstations connected by a Local Area Network, and (iii) SGI and the two DEC Alpha's connected by Internet. We found that our SBN-based algorithm performed well over a wide range of workloads, and computer and communication configurations.
This paper considers whether the seemingly disparate fields of Computational Intelligence (CI) and computer architecture can profit from each others' principles, results and experience. In the process, we identify...
详细信息
ISBN:
(纸本)0818681306
This paper considers whether the seemingly disparate fields of Computational Intelligence (CI) and computer architecture can profit from each others' principles, results and experience. In the process, we identify important common issues, such as parallelism, distribution of data and control, granularity and regularity. We present two novel computer architectures which have profited from principles found in CI, and identify two constraints on CI to eliminate the hidden influence of the von Neumann model of computation.
Contract-Linda is a novel programming architecture for heterogeneous parallel machines particularly suited to image processing. Previous research has concentrated on static and pre-determined scheduling of computation...
详细信息
ISBN:
(纸本)0819425885
Contract-Linda is a novel programming architecture for heterogeneous parallel machines particularly suited to image processing. Previous research has concentrated on static and pre-determined scheduling of computation and on fine grain parallelism. Pre-determined scheduling is satisfactory in cases where the computational process is fully deterministic. However with many image interpretation schemes the flow of control and the nature of the computational procedures can only be determined at run-time. In this paper we describe a programming paradigm for coarse grain and task level parallelism. Task management is based on the Contract Net protocol and utilises the Linda. Coordination Language to provide run-time scheduling. This paradigm accommodates a closely coupled network of heterogeneous processing modules which differ greatly in computational capability;modules based on neural networks, semantic networks, vector and scalar processors are accommodated. Contract;Linda allo-cvs specialised heterogeneous machines to be exploited using a straightforward generic programming model. It does this by providing an internal task management mechanism which ensures that the heterogeneous processing elements are used by the tasks most suited to them and exploits dynamic parallelism within the problem as it is solved. By separating the task of describing the problem from that of describing how the work is carried out on the machine (and providing a solution for this problem) we allow applications to be quickly developed which can effectively utilise specialised machines without the need for specialised programming. We report an experiment to re-implement a cell image interpretation system using Contract-Linda.
The objective of this paper is to develop a parallel overlapping mesh technique for the solution of the compressible Euler equations. The overlapping mesh technique facilitates the grid generation in complex geometrie...
详细信息
The objective of this paper is to develop a parallel overlapping mesh technique for the solution of the compressible Euler equations. The overlapping mesh technique facilitates the grid generation in complex geometries and is also used as a grid-partitioning algorithm in parallelisation. Blending functions are introduced to allow for multiple overlapping between subdomains. The parallel implementation is obtained by using PVM approach. Numerical tests were performed on a 16-processor CRAY CS6400 and on a 2-processor SPARCstation 20 Model 612.
The relationships between configurable computing, ASICs, and microprocessors have several important implications. First, sequential programming languages and related compilation approaches are not likely to be a good ...
详细信息
The relationships between configurable computing, ASICs, and microprocessors have several important implications. First, sequential programming languages and related compilation approaches are not likely to be a good match for highly parallel configurable-computing applications. While it may be possible to achieve moderate speedup, significant speedup will only be achieved by directly exploiting massive amounts of parallelism. This is currently done using low-level circuit design tools. Second, the architectural organization will be much more distributed than is commonly found in existing computer systems. Finally, hybrid systems of microprocessors and FPGAs are best coupled flexibly to fully exploit the best features of each device.
The proceedings contain 19 papers. The special focus in this conference is on Communication and Architectural Support for Network-Based parallelcomputing. The topics include: Efficient communication mechanisms for cl...
ISBN:
(纸本)3540625739
The proceedings contain 19 papers. The special focus in this conference is on Communication and Architectural Support for Network-Based parallelcomputing. The topics include: Efficient communication mechanisms for cluster based parallelcomputing;stream sockets on SHRIMP;a simple and efficient process and communication abstraction for network operating systems;efficient adaptive routing in networks of workstations with irregular topology;a deadlock avoidance method for computer networks;extending ATM networks for efficient reliable multicast;a single-chip ATM switch for NOWs;a portable threads library supporting migrant threads on heterogeneous network farms;transparent treatment of remote pointers using IPC primitive in RPC systems;an operating system support to low-overhead communications in NOW clusters;distributed hardware support for process synchronization in NSM workstation clusters;synchronization support in I/O adapter based SCI clusters;load balancing for regular data-parallel applications on workstation network;a comparison of three high speed networks for parallel cluster computing;understanding the performance of DSM applications;performance metrics and measurement techniques of collective communication services;connection-less, lightweight, and multiway communication support for distributedcomputing;network-wide cooperative computing architecture NCCA and data movement and control substrate for parallel scientific computing.
Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallelcomputing. However, obtaining high performance on these machines requires that an application execute with g...
详细信息
Cache-coherent multiprocessors with distributed shared memory are becoming increasingly popular for parallelcomputing. However, obtaining high performance on these machines requires that an application execute with good data locality. In addition to making effective use of caches, it is often necessary to distribute data structures across the local memories of the processing nodes, thereby reducing the latency of cache misses. We have designed a set of abstractions for performing data distribution in the context of explicitly parallel programs and implemented them within the SGI MIPSpro compiler system. Our system incorporates many unique features to enhance both programmability and performance. We address the former by providing a very simple programming model with extensive support for error detection. Regarding performance, we carefully design the user abstractions with the underlying compiler optimizations in mind, we incorporate several optimization techniques to generate efficient code for accessing distributed data, and we provide a tight integration of these techniques with other optimizations within the compiler. Our initial experience suggests that the directives are easy to use and can yield substantial performance gains, in some cases by as much as a factor of 3 over the same codes without distribution.
暂无评论