In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics processing Units (GPU). Given two polynomials in Z[x, y], our algorithm first maps the polynomials to a prime fie...
详细信息
ISBN:
(纸本)9783642131189
In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics processing Units (GPU). Given two polynomials in Z[x, y], our algorithm first maps the polynomials to a prime field. then, each modular image is processed individually. the GPU evaluates the polynomials at a number of points and computes univariate modular resultants in parallel. the remaining "combine" stage of the algorithm is executed sequentially on the host machine. Porting this stage to the graphics hardware is an object of ongoing research. Our algorithm is based on an efficient modular arithmetic from [1]. Withthe theory of displacement structure we have been able to parallelize the resultant algorithm up to a very fine scale suitable for realization on the GPU. Our benchmarks show a substantial speed-up over a host-based resultant algorithm [2] from CGAL (***).
the article is concerned with a parallel iterative domain decomposition algorithm for high-order finite element solutions of the Helmholtz wave equation. the iteration is performed in a block-Jacobi manner. For the in...
详细信息
ISBN:
(纸本)9783642131356
the article is concerned with a parallel iterative domain decomposition algorithm for high-order finite element solutions of the Helmholtz wave equation. the iteration is performed in a block-Jacobi manner. For the interface operator, a Robin interface boundary condition is employed in a modified form which allows possible discontinuities of the discrete normal flux on the subdomain interfaces. the convergence of the algorithm is analyzed using energy estimates. Numerical results are given to show the effectiveness and parallel efficiency of the algorithm for the simulation of high-frequency waves in heterogeneous media in the two-dimensional space. the algorithm is carried out on a 16-node Linux cluster;it has been observed more than 97% parallel efficiency for all tested problems.
In this paper we present our experience in implementing several irregular problems using a high-level actor language. the problems studied require dynamic computation of object placement and may result in load imbalan...
详细信息
ISBN:
(纸本)0818672552
In this paper we present our experience in implementing several irregular problems using a high-level actor language. the problems studied require dynamic computation of object placement and may result in load imbalance as the computation proceeds, thereby requiring dynamic load balancing. the algorithms are expressed as fine-grained computations providing maximal flexibility in adapting the computation load to arbitrary parallelarchitectures. Such an algorithm may be composed with different partitioning and distribution strategies (PDS's) to result in different performance characteristics. the PDS's are implemented for specific data structures or algorithms and are reusable for different parallelalgorithms. We demonstrate how our methodology provides portability of algorithm specification, reusability and ease of expressibility.
Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is...
详细信息
ISBN:
(纸本)9780769533520
Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is difficult to meet withthe increasing high performance requirements of diversified applications at different levels for general-purpose computing. A promising feasible solution is the novice multi-core systems which extend the parallelism to CPU level by integrating multiple processing units on a single die. this paper uses Finite-Difference Time-Domain (FDTD) algorithm as a case study, designing suitable parallel FDTD algorithms for three architectures: distributed-memory machines with single-core processors, shared-memory machines with dual-core processors, and the Cell Broadband Engine (Cell/B.E.) processor with nine heterogeneous cores. the experiment results show that the Cell/B.E. processor using 8 SPEs achieves a significant speedups of 7.05 faster than AMD single-core Opteron processor and 3.37 than AMD dual-core Opeteron processor at the Processor level.
作者:
Huang, HChinese Acad Sci
Supercomp Ctr Comp Network Informat Ctr Beijing 100080 Peoples R China
In this paper, we use a new language-TPL (Tensor product Language) to compute the Fast Fourier Transform. It can provide good performance and portability. We detail the method and application to the FFT of TPL, andext...
详细信息
ISBN:
(纸本)0769515126
In this paper, we use a new language-TPL (Tensor product Language) to compute the Fast Fourier Transform. It can provide good performance and portability. We detail the method and application to the FFT of TPL, andextendto Sande-Tucky FFT algorithm.
作者:
El Baz, D.CNRS
LAAS 7Ave Colonel Roche F-31077 Toulouse 4 France
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2...
详细信息
ISBN:
(纸本)9780769527840
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2, PVM or SHMEM are discussed in this survey paper Practical impleinentations are proposed.
In this paper, we compare a wide range of accelerator architectures (GPUs from AMD and NVIDIA, the Xeon Phi, and a DSP), by means of a signal-processing pipeline that processes radio-telescope data. We discuss the map...
详细信息
ISBN:
(纸本)9781509028238
In this paper, we compare a wide range of accelerator architectures (GPUs from AMD and NVIDIA, the Xeon Phi, and a DSP), by means of a signal-processing pipeline that processes radio-telescope data. We discuss the mapping of the algorithms from this pipeline to the accelerators, and analyze performance. We also analyze energy efficiency, using custom-built, microcontroller-based power sensors that measure the instantaneous power consumption of the accelerators, at millisecond time scale. We show that the GPUs are the fastest and most energy efficient accelerators, and that the differences in performance and energy efficiency are large.
this paper is devoted to the research of bitmap image processing based on wavelet functions. the Daubechies wavelet function was used as a mathematical model for filtering, compression and smoothing of two-dimensional...
详细信息
ISBN:
(纸本)9781538672242
this paper is devoted to the research of bitmap image processing based on wavelet functions. the Daubechies wavelet function was used as a mathematical model for filtering, compression and smoothing of two-dimensional signals, because the analysis of existing wavelet functions showed that the Daubechies wavelet family is most effective for image processing. OpenMP parallel programming in C/C++ was used for the parallelization of computing processes in image processing problems.
As to Markov cipher, its transition probability matrix is a doubly stochastic one. the eigenvalue of the matrix with maximum magnitude less than one plays an important role in designing Markov cipher this paper provid...
详细信息
ISBN:
(纸本)0769515126
As to Markov cipher, its transition probability matrix is a doubly stochastic one. the eigenvalue of the matrix with maximum magnitude less than one plays an important role in designing Markov cipher this paper provides a parallel algorithm for computing the eigenvalue of the doubly stochastic matrix A of size 65535x65535, which comes from a Markov cipher shrunken model with both 16 bits plaintext and ciphertext, an analysis on the complexity of the parallel algorithm is also considered.
In order to reduce the real-time processing gap of distributed algorithms on smart camera networks, we have evaluated a parallelprocessing method which exploits the processing capabilities of free cameras as auxiliar...
详细信息
ISBN:
(纸本)9781450347860
In order to reduce the real-time processing gap of distributed algorithms on smart camera networks, we have evaluated a parallelprocessing method which exploits the processing capabilities of free cameras as auxiliary for busy ones. the communication infrastructure or camera processing capabilities are not included in the traditional evaluation methods like datasets or even virtual reality tools, so they are not suitable to analyze the complexity of distributed algorithms. For this purpose, we have developed a modular framework based on OMNET++ simulation environment named CAM-DIST, in which the velocity model of soccer players are emulated. Simulation results show that parallelprocessing improves overall efficiency, but could have side effects on individual camera estimations;so choosing optimal sharing value and destinations will enhance the performance.
暂无评论