Complex number multiplication is one of the most important arithmetic operations in signal processing. the paper proposes a design of high speed 8-bit complex number multiplier where the multiplication process is carr...
详细信息
the proceedings contain 7 papers. the topics discussed include: an interactive tool based on Polly for detection and parallelization of loops;on expressing strategies for directive-driven multicore programming models;...
ISBN:
(纸本)9781450326070
the proceedings contain 7 papers. the topics discussed include: an interactive tool based on Polly for detection and parallelization of loops;on expressing strategies for directive-driven multicore programming models;effective platform-level exploration for heterogeneous multicores exploiting simulation-induced slacks;a cycle-accurate synthesizable MIPS simulator in Simulink;extending a run-time resource management framework to support OpenCL and heterogeneous systems;exploiting performance counters for energy efficient co-scheduling of mixed workloads on multi-core platforms;and fine-grained link locking within power and latency transaction level modelling in wormhole switching non-preemptive networks on chip.
作者:
Jeljeli, HamzaUniv Lorraine
CARAMEL Project Team LORIA INRIACNRS Campus SciBP 239 F-54506 Vandoeuvre Les Nancy France
In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. the index-calculus methods, that attack the DLP in multiplicative subgroups of finite f...
详细信息
ISBN:
(数字)9783319098739
ISBN:
(纸本)9783319098739;9783319098722
In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. the index-calculus methods, that attack the DLP in multiplicative subgroups of finite fields, require solving large sparse systems of linear equations modulo large primes. this article deals with how we can run this computation on GPU- and multi-core-based clusters, featuring InfiniBand networking. More specifically, we present the sparse linear algebra algorithmsthat are proposed in the literature, in particular the block Wiedemann algorithm. We discuss the parallelization of the central matrix-vector product operation from both algorithmic and practical points of view, and illustrate how our approach has contributed to the recent record-sized DLP computation in GF(2(809)).
Matrix eigenvalue theory has become an important analysis tool in scientific computing. Sometimes, people do not need to find all eigenvalues but only the maximum eigenvalue. Existing algorithms of finding the maximum...
详细信息
ISBN:
(数字)9783319111940
ISBN:
(纸本)9783319111940;9783319111933
Matrix eigenvalue theory has become an important analysis tool in scientific computing. Sometimes, people do not need to find all eigenvalues but only the maximum eigenvalue. Existing algorithms of finding the maximum eigenvalue of matrices are implemented sequentially. Withthe increasing of the orders of matrices, the workload of calculation is getting heavier. therefore, traditional sequential methods are unable to meet the need of fast calculation for large matrices. this paper proposes a parallel algorithm named PA-ST to find the maximum eigenvalue of positive matrices by using similarity transformation which is implemented by CUDA (Computer Unified Device Architecture) on GPU (Graphic Process Unit). To the best of our knowledge, this is the first CUDA based parallel algorithm of calculating maximum eigenvalue of matrices. In order to improve the performance, optimization techniques are applied in this paper such as using the shared memory rather than the global memory to improve the speed of computation, avoiding bank conflicts by setting the span index, satisfying the principle of coalesced memory access, and by using single-precision floating-point arithmetic and the pinned memory to reduce the copy operation and obtain higher data transfer bandwidth between the host and the GPU device. the experimental results show that the similarity transformation technique can significantly shorten the running time compared to the sequential algorithm and the speedup ratio is nearly stable when the number of iterations increases. As the matrix order increases, the running time of the sequential algorithm and PA-ST increases correspondingly. Experiments also show that the speedup ratio of the PA-ST is between 2.85 and 35.028.
Separable 2-D transforms (such as the 2-D Fourier transform) are widely used in fields such as data analysis and image processing. Fast processors and divide-and-conquer algorithms have made these 2-D transforms acces...
详细信息
ISBN:
(纸本)9781880843970
Separable 2-D transforms (such as the 2-D Fourier transform) are widely used in fields such as data analysis and image processing. Fast processors and divide-and-conquer algorithms have made these 2-D transforms accessible on desktop computers. Nonetheless, the widespread use of multi-core architectures makes significant efficiency improvements possible. In the past, parallelprocessing in C++ has been restricted to external libraries. But the recent release of C++11 introduces concurrency constructs into the language itself, providing obvious benefits to software development, optimization, and portability. this paper examines the high level concurrency interface in C++11, and demonstrates that significant efficiency gains are achievable for the 2-D Fourier transform on standard multi-core processors. this approach is readily extensible to other separable 2-D transforms such as wavelet transforms and the discrete cosine transform, and is equally applicable to other separable 2-D operations such as convolution and correlation. It promises to scale well to future multi-core processors, as additional CPUs become available. Copyright ISCA, CAINE 2014.
Planners need to become faster as we seek to tackle increasingly complicated problems. Much of the recent improvements in computer speed is due to multi-core processors. For planners to take advantage of these types o...
详细信息
ISBN:
(纸本)9781577356608
Planners need to become faster as we seek to tackle increasingly complicated problems. Much of the recent improvements in computer speed is due to multi-core processors. For planners to take advantage of these types of architectures, we must adapt algorithms for parallelprocessing. there are a number of planning domains where state expansions are slow. One example is robot motion planning, where most of the time is devoted to collision checking. In this work, we present PA*SE, a novel, parallel version of A* (and weighted A*) which parallelizes state expansions by taking advantage of this property. While getting close to a linear speedup in the number of cores, we still preserve completeness and optimality of A* (bounded sub-optimality of weighted A*). PA*SE applies to any planning problem in which significant time is spent on generating successor states and computing transition costs. We present experimental results on a robot navigation domain (x,y,heading) which requires expensive 3D collision checking for the PR2 robot. We also provide an in-depth analysis of the algorithm's performance on a 2D navigation problem as we vary the number of cores (up to 32) as well as the time it takes to collision check successors during state expansions.
the proceedings contain 52 papers. the special focus in this conference is on Image processing and Communications. the topics include: Adaptive windowed threshold for box counting algorithm in cytoscreening applicatio...
ISBN:
(纸本)9783319016214
the proceedings contain 52 papers. the special focus in this conference is on Image processing and Communications. the topics include: Adaptive windowed threshold for box counting algorithm in cytoscreening applications;corneal endothelial grid structure factor based on coefficient of variation of the cell sides lengths;a distributed approach for development of deformable model-based segmentation methods;enhancement of low-dose CT brain scans using graph-based anisotropic interpolation;using of EM algorithm to image reconstruction problem with tomography noises;the perception of humanoid robot by human;a fast histogram estimation based on the Monte Carlo method for image binarization;adaptation of the combined image similarity index for video sequences;time-frequency analysis of image based on stockwell transform;2DKLT-based color image watermarking;knowledge transformations applied in image classification task;problems of infrared and visible-light images automatic registration;semantics driven table understanding in born-digital documents;contextual possibilistic knowledge diffusion for images classification;image feature extraction using compressive sensing;a modification of the parallel spline interpolation algorithms;local Eigen background substraction;polarization imaging in 3D shape reconstrucion;layer image components geometry optimization;extraction of data from limnigraf chart images;a simplified visual cortex model for efficient image codding and object recognition;fuzzy rule-based systems for optimizing power consumption in data centers;an algorithm for finding shortest path tree using ant colony optimization metaheuristic and a sensor network gateway for IP network and SQL database.
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to matc...
详细信息
ISBN:
(纸本)9780735412125
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. the geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. the most efficient like conjugate gradients are very complex to program for the required derivative information the least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. this paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. the efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
the necessity for thread-safe experiment software has recently become very evident, largely driven by the evolution of CPU architectures towards exploiting increasing levels of parallelism. For high-energy physics thi...
详细信息
the necessity for thread-safe experiment software has recently become very evident, largely driven by the evolution of CPU architectures towards exploiting increasing levels of parallelism. For high-energy physics this represents a real paradigm shift, as concurrent programming was previously only limited to special, well-defined domains like control software or software framework internals. this paradigm shift, however, falls into the middle of the successful LHC programme and many million lines of code have already been written without the need for parallel execution in mind. In this paper we have a closer look at the offline processing applications of the LHC experiments and their readiness for the many-core era. We review how previous design choices impact the move to concurrent programming. We present our findings on transforming parts of the LHC experiment reconstruction software to thread-safe code, and the main design patterns that have emerged during the process. A plethora of parallel-programming patterns are well known outside the HEP community, but only a few have turned out to be straightforward enough to be suited for non-expert physics programmers. Finally, we propose a potential strategy for the migration of existing HEP experiment software to the many-core era.
Multi-processor systems are advantageous in the sense that they allow concurrent execution of the given workload. the workload can be thought as the computation units which can be either processes or tasks. these proc...
详细信息
ISBN:
(纸本)9781479939756
Multi-processor systems are advantageous in the sense that they allow concurrent execution of the given workload. the workload can be thought as the computation units which can be either processes or tasks. these processes or tasks can either be independent programs or partitioned modules of a single program. this paper presents an algorithm named as "Successive Stage Multi Round Scheduling" which is able to allocate and balance the given workload among the connected processing units of the Multi-processor system in order to improve the efficiency of the system. Simulation results are obtained using the hypercube architecture due to its simple design and high interconnectivity and results are compared with two existing schemes namely "Minimum Distance Scheduling" and "Two Round Scheduling."
暂无评论