In this paper parallel solving symmetric eigenproblems, which include standard and generalized eigenvalue problems, is discussed. For standard eigenvalue problem and tridiagonal eigenvalue problem is not the key point...
详细信息
Window-based parallelarchitectures are here considered as target structures for the computation of low and medium level image processingalgorithms. their definition stems from a general reformulation of algorithms, ...
详细信息
the paper concerns the parallel computing and its application for solving the full Lyapunov exponents in the general nonlinear parameter-dependent continuous ordinary differential equations. Based on a standard serial...
详细信息
ISBN:
(纸本)0769515126
the paper concerns the parallel computing and its application for solving the full Lyapunov exponents in the general nonlinear parameter-dependent continuous ordinary differential equations. Based on a standard serial algorithm developed by Wolf et al.'s [1], we present a parallel algorithm using the block-cyclic decomposition method, and then apply it for solving the Lyapunov exponents of a continuous differential equation. By testing its performance of the parallel algorithm on the supercomputer DAWNING-2000II, it is proved that the parallel algorithm is of high level parallelism, no need for message passing (little communication cost), and little I/O. In addition, the algorithm can be extended to any high dimensional ordinary differential equations.
As a classical method of image segmentation in mathematical morphology, the watershed transform has been applied successively into some fields like remote sensing image processing, biomedical and computer vision appli...
详细信息
In this paper, we introduce a formal approach for synthesis of parallelarchitectures. Four different forms are used to express the given algorithms: simultaneous recursion, recursion with respect to different variabl...
详细信息
In this study, we will parallelize the D&C algorithm with CUDA. In stead of recursive programming in D&C, the recursive stack is implemented on the host side (CPU) and the merge operation is executes on GPU in...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
In this study, we will parallelize the D&C algorithm with CUDA. In stead of recursive programming in D&C, the recursive stack is implemented on the host side (CPU) and the merge operation is executes on GPU in parallel. Since the recursive stack is a fully binary tree in this algorithm, the merge operations on the nodes in each layer of the binary tree can be performed synchronously. In this data-parallel computation, withthe careful management of data structure, the data of each node can be arranged in the same block and no need to share data between threads, so the parallelism is not broken.
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are...
详细信息
ISBN:
(纸本)9781467375894
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are composed of different processing units, often with massively parallel computing unit. However, embedding complex algorithms on these SoCs (System on Chip) remains a difficult task due to heterogeneity, it is not easy to decide how to allocate parts of a given algorithm on processing units of a given SoC. In order to help automotive industry in embedding algorithms on heterogeneous architectures, we propose a novel approach to predict performances of image processingalgorithms on different computing units of a given heterogeneous SoC. Our methodology is able to predict a more or less wide interval of execution time with a degree of confidence using only high level description of algorithms to embed, and a few characteristics of computing units.
In this paper some implicit domain decomposition procedures for solving parabolic problems are proposed. In these methods, the classic implicit scheme is used in each sub-domain, and Dirichlet boundary values at the (...
详细信息
In this paper, an improved version of the BiCGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. the method combines elements o...
详细信息
ISBN:
(纸本)0769515126
In this paper, an improved version of the BiCGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. the method combines elements of numerical stability and parallel algorithm design without increasing the computational costs. the algorithm is derived such that all inner products of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time of vector updates. therefore, the cost of global communication which represents the bottleneck of the parallel performance can be significantly reduced. the resulting IBiCGStab algorithm maintains the favorable properties of the original method while not increasing computational costs. Data distribution suitable for both irregularly and regularly structured matrices based on the analysis of the non-zero matrix elements is presented. Communication scheme is supported by overlapping execution of computation and communication to reduce waiting times. the efficiency of this method is demonstrated by numerical experimental results carried out on a massively parallel distributed memory system.
We propose an improved version of the CGS method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices. the proposed method combines elements of numerical stability an...
详细信息
ISBN:
(纸本)0769515126
We propose an improved version of the CGS method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices. the proposed method combines elements of numerical stability and parallel algorithm design without increasing computational costs. the algorithm is derived such that all matrix-vector multiplication, inner products and vector updates of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time of vector updates. therefore, the cost of global communication which represents the bottleneck of the performance can be significantly reduced. In this paper, the Bulk Synchronous parallel (BSP) model is used to design a fully efficient, scalable and portable parallel proposed algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec, and a cluster of workstations connected by an Ethernet. this performance model uses only a few system dependent parameters based on a simple and accurate cost modelling to provide useful insight in the time complexity of the method. the theoretical performance prediction are compared with some preliminary measured timing results of a numerical application from ocean flow simulation.
暂无评论