This paper describes a software environment devised to support parallel and sequential discrete-event simulation. It provides assistance to the user in issues such as selection of the synchronization protocol to be us...
详细信息
This paper describes a software environment devised to support parallel and sequential discrete-event simulation. It provides assistance to the user in issues such as selection of the synchronization protocol to be used in the execution of the simulation of the model. The software framework has been built upon the bulk-synchronous model of parallel computing. The well-defined structure of this model allowed us to predict the running time cost of synchronization protocols in accordance with the particular work-load generated by the execution of the simulation model. We exploit this feature to automatically generate the simulation program.
The problem of tree pattern matching for object recognition in images is computationally intensive in nature. In two-dimensional images, the objects can be represented through multiscale decomposition as tree structur...
详细信息
The problem of tree pattern matching for object recognition in images is computationally intensive in nature. In two-dimensional images, the objects can be represented through multiscale decomposition as tree structures. The pattern tree representing an object can be matched with a subject tree representing an image in order to detect the objects within the image. Several sequential, parallel and hardware algorithms exist in the literature for tree pattern matching. In this paper, we describe a new parallel algorithm and its realization as a VLSI chip for tree pattern matching. The hardware algorithm is based on a linear array of processing elements (PEs) where the pattern matching is done in a pipelined fashion relying on nearest-neighbor communication between the PE's and the subject and pattern trees of arbitrary length can be processed using a fixed size PE array. The algorithm has an improved execution time of O(lceilm/arceiln) required to perform the matching where m, a and n are the sizes of the pattern tree, processor array, subject tree respectively. A prototype CMOS VLSI chip implementing the proposed algorithm has been designed and verified. It is shown that the hardware algorithm proposed in this work represent a sign improvement in terms of computational complexity, data flow, and architecture over the ones previously proposed for this problem
We propose an efficient parallel algorithm with simple static and dynamic scheduling for generating combinations. It can use any number of processors (NPlesn-m+1) in order to generate the set of all combinations of C(...
详细信息
We propose an efficient parallel algorithm with simple static and dynamic scheduling for generating combinations. It can use any number of processors (NPlesn-m+1) in order to generate the set of all combinations of C(n,m). The main characteristic of this algorithm is to require no integer larger than n during the whole computation. The performance results show that even without a perfect load balance, this algorithm has very good performance, mainly when n is large. Besides, the dynamic algorithm presents a good performance on heterogeneous parallel platforms
In beam-beam macroparticle simulations for collider rings, the accurate determination of the incoherent spectrum and potentially unstable coherent modes requires (1) large numbers of collisions, and (2) accurate elect...
详细信息
In beam-beam macroparticle simulations for collider rings, the accurate determination of the incoherent spectrum and potentially unstable coherent modes requires (1) large numbers of collisions, and (2) accurate electric field solutions at each collision. On a single processor, a selfconsistent simulation typically uses a 2D model of the beam-beam interaction in order to achieve a reasonable computation time, however for the long (~0.3 m) bunches in the LHC we wish to include the third dimension in order to account for effects such as longitudinal motion, crossing angle, and the beam size and density variations. We describe here a parallel algorithm, developed with MPI on a small commodity Linux cluster, to extend our simulation code BeamX from 2D to 3D using longitudinal subdivision (slicing) of the bunches. Although this paper concentrates on the computing methods, some performance trials and example results will also be shown
We have previously introduced the massively parallel global cellular automata (GCA) model. parallel algorithms derived from applications can be mapped straightforward onto this model. In this model a cell in the cell ...
详细信息
We have previously introduced the massively parallel global cellular automata (GCA) model. parallel algorithms derived from applications can be mapped straightforward onto this model. In this model a cell in the cell field is dynamically connected (access pattern, dynamic neighbourhood) to other cells. The model can be implemented by pointers stored in the cell state. Via these pointers, each cell has read access to any other cell in the cell field, and the pointers may be changed from generation to generation. We have investigated different types of the model in order of minimize hardware/software implementation cost. So we have classified the GCA into types with respect to space, time or data dependency of the access pattern. We have investigated a number of different GCA algorithms and found out, that in most cases a time dependent access pattern is sufficient. To find out the usefulness of the data dependent access pattern we constructed a sophisticated merge sort algorithm, in which the target addresses are computed in contrast to classical algorithms where the data elements are moved. It turned out, that we could not achieve a speed up which we expected compared to an algorithm implemented on the more simple time dependent model. This is another confirmation that it is sufficient to implement only the time and space dependent model and thus reduce the hardware/software implementation cost.
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, i...
详细信息
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We address a number of algorithmic issues in parallel data cube construction. First, we present an aggregation tree for sequential (and parallel) data cube construction, which has minimally bounded memory requirements. An aggregation tree is parameterized by the ordering of dimensions. We present a parallel algorithm based upon the aggregation tree. We analyze the interprocessor communication volume and construct a closed form expression for it. We prove that the same ordering of the dimensions minimizes both the computational and communication requirements. We also describe a method for partitioning the initial array and prove that it minimizes the communication volume. Experimental results from implementation of our algorithms on a cluster of workstations validate our theoretical results
We present a parallel elite-subspace evolutionary algorithm for solving systems of nonlinear equations. On the basis of numerical experiments, we find that our algorithm has an outstanding universal characteristics an...
详细信息
ISBN:
(纸本)0780378040
We present a parallel elite-subspace evolutionary algorithm for solving systems of nonlinear equations. On the basis of numerical experiments, we find that our algorithm has an outstanding universal characteristics and superior convergence as well. All of the solutions can be obtained within a short period of time.
In this paper we present a 3-D parallel filtering algorithm. This algorithm is highly parallel and efficient as it eliminates the overhead associated with the overlapping segments in the block-filtering approach. It a...
详细信息
In this paper we present a 3-D parallel filtering algorithm. This algorithm is highly parallel and efficient as it eliminates the overhead associated with the overlapping segments in the block-filtering approach. It also lifts the restrictions on the input size for high efficiency in the block-filtering algorithm, as both the 3-D input data and impulse response of the system are decimated into eight subsections each. These subsections can be simultaneously and independently processed. The results of the implementation of the 3-D parallel filtering algorithm on multiDSP platform is presented and discussed showing a high performance reflected by the highly parallel architecture and good memory distribution of the 3-D parallel algorithm.
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, i...
详细信息
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.
This paper discusses the parallel propagation algorithm applied to the decoding of convolutional codes. The parallel algorithm surpasses the BCJR algorithm in parallel computational complexity and performance when it ...
详细信息
This paper discusses the parallel propagation algorithm applied to the decoding of convolutional codes. The parallel algorithm surpasses the BCJR algorithm in parallel computational complexity and performance when it is applied to tail-biting codes. The complexity of iterative decoding algorithms depends on the calculation of posterior probability and bit error probability by using clique and number of iterations. The computational complexity of the parallel algorithm for tail-biting codes is almost the same as that for zero-tail convolutional codes.
暂无评论