In this paper we show how to obtain optimal speedup in a master-slave model for solving data-parallel problems. Given the number of homogeneous workstations, their computation time for solving a basic sub-task of the ...
详细信息
ISBN:
(纸本)0818682596
In this paper we show how to obtain optimal speedup in a master-slave model for solving data-parallel problems. Given the number of homogeneous workstations, their computation time for solving a basic sub-task of the problem, network transmission bandwidth and data volume per basic sub-task, the per-distribution number of basic sub-tasks sent to a slave for attaining the optimal speedup can be decided. The effectiveness of the proposed theory has been tested using a parallel computing experiment involving the Hough transformation.
String matching problem received much attention over the years due to its importance in various applications such as text/file comparison, DNA sequencing, search engines, and spelling correction. Especially with the i...
详细信息
ISBN:
(纸本)0818682596
String matching problem received much attention over the years due to its importance in various applications such as text/file comparison, DNA sequencing, search engines, and spelling correction. Especially with the introduction of search engines dealing with tremendous amount of textual information presented on the world wide web and the research on DNA sequencing, this problem deserves special attention and any algorithmic or hardware improvements to speed up the process will benefit these important applications. In this paper, we present three algorithms for string matching on reconfigurable mesh architectures. Given a text T of length n and a pattern P of length m, the first algorithm finds the exact matching between T and P in O(1) time on a 2-dimensional RMESH of size (n - m + 1) x m. The second algorithm finds the approximate matching between T and P in O(k) time on a 2D RMESH, where k is the maximum edit distance between T and P. The third algorithm allows only the replacement operation in the calculation of the edit distance and finds an approximate matching between T and P in constant-time on a 3D RMESH.
In this paper a new method is proposed to synchronize massively parallel processes in distributed multiprocessor systems. The method is an extension of that used in arbitration systems like Futurebus+. It also uses th...
详细信息
ISBN:
(纸本)0818682596
In this paper a new method is proposed to synchronize massively parallel processes in distributed multiprocessor systems. The method is an extension of that used in arbitration systems like Futurebus+. It also uses three global synchronization lines and a distributed synchronizer, and can be applied without changes to the existing hardware. The method allows to carry out two alternative synchronization protocols where the end of the operation is either forced by any processor or identified as a moment when all processors completed their individual parts. Application of this method, e.g., to the arbitration process allows to reduce the arbitration time in average by a factor of 2.
We propose a pipelined asynchronous time division multiplexing optical bus. Such a bus can use one of the two hardwared priority schemes, the linear priority scheme and the round-robin priority scheme. Our simulation ...
详细信息
ISBN:
(纸本)0818682596
We propose a pipelined asynchronous time division multiplexing optical bus. Such a bus can use one of the two hardwared priority schemes, the linear priority scheme and the round-robin priority scheme. Our simulation results show that the performance of the proposed bus is significantly better than the performances of known pipelined synchronous time division multiplexing optical buses.
We present a package of parallel algorithms for multiple precision arithmetic, implemented by using a message-passing model of computation. These algorithms are organized in an object oriented library and perform para...
详细信息
We present a package of parallel algorithms for multiple precision arithmetic, implemented by using a message-passing model of computation. These algorithms are organized in an object oriented library and perform parallel arithmetic in Z, Q, and Zp. The library has a layered structure which provides portability and the possibility of extending the code easily. In the bottom layer we implement fundamental parallel algorithms which are machine dependent. Starting from integer multiplication, we developed different parallel Karatsuba-type and FFT-based schemes, including the integer 3-primes and floating-point FFT algorithms. These multiplication algorithms allowed us to design a parallel Newton method for division. parallel algorithms for modular arithmetic result by using both Montgomery's and classical arithmetic. We designed these algorithms under a message-passing model of computation. These are implemented on different parallel platforms, targeting both distributed memory machines, such as massively parallel processing systems and networks of workstations, and shared memory architectures. We developed both architecture dependent and architecture independent algorithms by using standard interfaces such as the Message Passing Interface, MPI. On top of this layer we provide a convenient interface to the user which allows a sequential programming style, while each algorithm is executed in parallel. At this level it is possible for the user to implement high level parallel algorithms, such as the GCD, which are machine independent. Finally we integrated our package in a computer algebra system, which uses a sequential frontend for interactive use, and a parallel backend for computation. We have called this package CALYPSO, a sort of acronym for Computer Algebra Library for parallel Symbolic Computation.
A simulation study of dynamic load balancing for parallel processing on network of workstations (NOW) is presented in this paper. A simulation model is constructed. It includes a representative CPU scheduling policy, ...
详细信息
ISBN:
(纸本)0818682596
A simulation study of dynamic load balancing for parallel processing on network of workstations (NOW) is presented in this paper. A simulation model is constructed. It includes a representative CPU scheduling policy, and also considers the message exchange, task transfer and migration costs explicitly. A global dynamic load balancing algorithm is simulated. Both task transfer and task migration are considered. The performance of the algorithm under both homogeneous and heterogeneous environments is analyzed. In addition, the interaction of parallel and sequential workloads on an NOW is also examined. Our results show that dynamic load balancing can achieve better performance improvement for heterogeneous systems than for homogeneous systems;it is especially effective in a system where both parallel and sequential tasks concurrently exist;and the use of task migration generally does not further improve performance.
In an earlier work on the design of fine-grain, scalable classifiers for massively parallel computers, the technique of unifying cascaded networks has been demonstrated. This paper further examines the method adopted ...
详细信息
ISBN:
(纸本)0818682596
In an earlier work on the design of fine-grain, scalable classifiers for massively parallel computers, the technique of unifying cascaded networks has been demonstrated. This paper further examines the method adopted using a highly parallel processing architecture, entitled Unified Hierarchical Classifiers (UHC), based on the principles of Generalised Regression Neural Networks (GRNN). As with the GRNN, it has been shown that the resulting classification network can be implemented efficiently on general-purpose multiprocessor platforms without dedicated hardware for processor interconnections. Adding to this the structural simplicity, and the demonstrable potential for an effective distributed realisation on the Cray T3D, will make UHC an attractive classifier architecture in practical applications.
With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasi...
详细信息
ISBN:
(纸本)0818682596
With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasingly popular, they remain very difficult to program. In fact, most compilers require the programmer to specify how to partition data or map program code to the system's processors. To ensure an effective program, cache locality is important because of the large speed gap between microprocessors and memory systems. It is also important to make use of local communication whenever possible, since it is cheaper, faster, and less power hungry than global communication. In order to exploit these locality properties, we present a systematic operation placement and scheduling scheme for fine-grain parallelarchitectures. The key advantages are two folds: (1) This multiprojection method, which deals with multidimensional parallelism systematically, can alleviate the burden of the programmer in coding and data partitioning. (2) It addresses the memory/communication bandwidth bottleneck, and can lead to faster program execution. On a special design example of the motion estimation block-matching algorithm, which requires the most intensive computation and memory accesses in video coding, our method leads to a reduction of external memory accesses by two to three orders of magnitude.
With the move towards parallel processing on Clusters of Workstations (COWs) the ability to fully utilise the computational resources of all workstations through the initial placement and movement of workload is desir...
详细信息
ISBN:
(纸本)0818682596
With the move towards parallel processing on Clusters of Workstations (COWs) the ability to fully utilise the computational resources of all workstations through the initial placement and movement of workload is desirable by the user of the system, to improve the overall performance. This is a critical task when many users run their parallel programs simultaneously. The ability to have an even spread of load over an entire COW can be achieved through the employment of Global Scheduling. This paper introduces the concept of global scheduling exploiting static allocation and dynamic load balancing along with the implementation of such a facility and support services, remote process creation and process migration, respectively on the RHODOS distributed operating system. The suitability of the RHODOS implementation is demonstrated by results obtained from initial test runs of a SPMD parallel application.
The main contribution of this work is to fathom the power and flexibility of the Mesh with Hybrid Buses via simulation. We propose two algorithms that perform an O(1) time stepwise simulation of an N-processor dynamic...
详细信息
ISBN:
(纸本)0818682596
The main contribution of this work is to fathom the power and flexibility of the Mesh with Hybrid Buses via simulation. We propose two algorithms that perform an O(1) time stepwise simulation of an N-processor dynamic Priority CRCW-PRAM endowed with M memory cells. Our fist algorithm uses a Mesh with Hybrid Buses of size max{N, M N-epsilon/2} x M N-epsilon/2, for some fixed constant epsilon, 0 < epsilon less than or equal to 1. Our second algorithm uses a Mesh with Hybrid Buses of size N x max{N, M}. The first algorithm is suited for small values of M, while the second is best suited for larger M.
暂无评论