The presented paper deals with possible approaches to parallel implementation of solution of a hyper-dimensional dynamical particle system. The proposed implementation approaches are generally applicable for similar p...
详细信息
The presented paper deals with possible approaches to parallel implementation of solution of a hyper-dimensional dynamical particle system. The proposed implementation approaches are generally applicable for similar particle systems of interest in various research and engineering fields. The original motivation for the present work was a simulation of particles that represent a space-filling design to be optimized for further use in design of experiments. Due to the underlying purpose of this particle system, the dimension of the particle system of interest is considered to be entirely arbitrary. Such a hyper-dimensional space is further folded into a periodically repeated domain. The theoretical background of the proposed particle system is provided along with the derivation of equations of motion of the dynamical system. As the complexity of the system is not limited by the number of particles nor the number of dimensions, the possibilities of utilizing the GPGPU platform are more restricted in comparison with today's fast parallel implementations of common particle systems. Two distinct approaches to parallel implementation are presented, one aiming at a generalized usage of the fast on-chip resources, the other entirely relying on the GPU's on-board global memory. Despite unambiguous mutual differences in their performance, both parallel implementations deliver major speedup compared to the single-thread CPU solution as well as a better scaling of execution time when increasing both the number of particles and dimensions.
Associative techniques are useful in computer vision because they are notably able to robustify a recognition system. The noise-like coding model of associative memory has been already applied successfully to image-cl...
详细信息
Associative techniques are useful in computer vision because they are notably able to robustify a recognition system. The noise-like coding model of associative memory has been already applied successfully to image-classification. This paper describes the implementation of the associative system on transputer-based architectures. After explaining the model's basic formalism, the paper marks out the key-generation mechanism, the data-mapping strategy, and the hierarchical processor organization. The basic result of this research is a general methodology for efficient HW configuration of real-time associative visual systems. The system's efficiency can be predicted by theoretical derivations, in which both the FFT-computation speed and the data-transmission speed play a crucial role. Experimental results including different HW configuration and different image-sizes always confirmed theoretical expectations.
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel...
详细信息
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel implementation based on the concurrency control theory of database systems is presented;the parallelised annealing process is serialisable. Concurrent updates to the base solution are allowed provided that they do not have data conflict. Using the travelling salesman problem as the example application, the parallel simulated annealing algorithm is implemented on a Motorola Delta 3000 shared-memory multiprocessor system with eight processors. With a moderate problem size of 400 cities, a speedup efficiency of over 90% is achieved at high annealing temperature and close to 100% at a low annealing temperature.
SCAN is a special purpose context-free language which describes and generates a wide range of array accessing algorithms from a short set of simple ones. These algorithms may represent scan techniques for image proces...
详细信息
SCAN is a special purpose context-free language which describes and generates a wide range of array accessing algorithms from a short set of simple ones. These algorithms may represent scan techniques for image processing, but at the same time they stand as generic data accessing strategies. In this paper we present two schemes (one sequential and one parallel) which implement the SCAN language and compare their memory requirements and execution time.
The proposed spectral element method implementation is based on sparse matrix storage of local shape function derivatives calculated at Gauss-Lobatto-Legendre points. The algorithm utilizes two basic operations: multi...
详细信息
The proposed spectral element method implementation is based on sparse matrix storage of local shape function derivatives calculated at Gauss-Lobatto-Legendre points. The algorithm utilizes two basic operations: multiplication of sparse matrix by vector and element-by-element vectors multiplication. Compute-intensive operations are performed for a part of equation of motion derived at the degree of freedom level of 3D isoparametric spectral elements. The assembly is performed at the force vector in such a way that atomic operations are minimized. This is achieved by a new mesh coloring technique The proposed parallel implementation of spectral element method on GPU is applied for the first time for Lamb wave simulations. It has been found that computation on multicore GPU is up to 14 times faster than on single CPU. Copyright (C) 2015 John Wiley & Sons, Ltd.
Nearly bandlimited signals play an important role in the biomedical signal processing community. The common method to analyze these signals is via the empirical mode decomposition approach which decomposes the non-sta...
详细信息
Nearly bandlimited signals play an important role in the biomedical signal processing community. The common method to analyze these signals is via the empirical mode decomposition approach which decomposes the non-stationary signals into the sums of the intrinsic mode functions. However, this method is computational demanding. A natural idea to reduce the computational cost is via the block processing. However, the severe boundary effect would happen due to the discontinuities between two consecutive blocks. In order to solve this problem, this paper proposes to realize the parallel implementation via polyphase representation. That is, the empirical mode decomposition is implemented on each polyphase component of the original signal. Then each sub-signals are combined after upsampling. The simulation results show that our proposed method achieves the approximate intrinsic mode functions both qualitatively and quantitatively very close to the true intrinsic mode functions. Besides, compared with the conventional block processing method which significantly suffered from the boundary effect problem, our proposed method does not have this issue.
In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000. They include matrix multiplication, LU factorization of a dense matrix, Cholesky factorization of a ...
详细信息
In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000. They include matrix multiplication, LU factorization of a dense matrix, Cholesky factorization of a symmetric matrix, and eigendecomposition of symmetric matrix for real and complex data types. These programs are constructed based on fast BLAS library of Dawning-1000 under NX *** comparison results under different parallel environments and implementing methods are also given for Cholesky factorization. The execution time, measured performance and speedup for each problem on Dawning-1000 are shown. For matrix multiplication and LU factorization, 1.86GFLOPS and 1.53GFLOPS are reached.
NIST(National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as ...
详细信息
NIST(National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as the 7 th test of the NIST Test Suit is remarkably time consuming and the slow performance is one of the major hurdles in the testing process. In this paper, we present an efficient bit-parallel matching algorithm and segmented scan-based strategy for execution on Graphics Processing Unit(GPU) using NVIDIA Compute Unified Device Architecture(CUDA). Experimental results show the significant performance improvement of the parallelized Non-overlapping Template Matching Test, the running speed is 483 times faster than the original NIST implementation without attenuating the test result accuracy.
The human vision has been studied deeply in the past years, and several different models have been proposed to simulate it on computer. Some of these models concerns visual saliency which is potentially very interesti...
详细信息
The human vision has been studied deeply in the past years, and several different models have been proposed to simulate it on computer. Some of these models concerns visual saliency which is potentially very interesting in a lot of applications like robotics, image analysis, compression, video indexing. Unfortunately they are compute intensive with tight real-time requirements. Among all the existing models, we have chosen a spatio-temporal one combining static and dynamic information. We propose in this paper a very efficient implementation of this model with multi-GPU reaching real-time. We present the algorithms of the model as well as several parallel optimizations on GPU with higher precision and execution time results. The real-time execution of this multi-path model on multi-GPU makes it a powerful tool to facilitate many vision related applications.
The conjugate gradient algorithm is well-suited for vector computation but, because of its many synchronization points and relatively short message packets, is more difficult to implement for parallel computation. In ...
详细信息
The conjugate gradient algorithm is well-suited for vector computation but, because of its many synchronization points and relatively short message packets, is more difficult to implement for parallel computation. In this work we introduce a parallel implementation of the block conjugate gradient alhorithm. In this algorithm, we carry a block of vectors along at each iteration, reducing the number of iterations and increasing the length of each message. On machines with relatively costly message passing, this algorithm is a significant improvement over the standard conjugate gradient algorithm.
暂无评论