Driven by the ever increasing algorithm complexity on the field of mobile communications systems, SIMD DSP architectures have emerged as an approach that offers the necessary processing power at reasonable levels of d...
详细信息
ISBN:
(纸本)0769520804
Driven by the ever increasing algorithm complexity on the field of mobile communications systems, SIMD DSP architectures have emerged as an approach that offers the necessary processing power at reasonable levels of die size and power consumption. However, this kind of DSP architectures imposes new challenges for programmers, since algorithms have to be designed to exploit the available parallelism on the processor. Taking as a starting point an algebraic framework that captures the SIMD computational model, we report in this paper about our efforts to design and automatically generate object code for our family of DSP architectures independent of the available SIMD parallelism. We show how these algebraic structures can be used as a high level programming language that offers a unified approach to design and describe algorithms using SIMD parallelism. Moreover, we show how these algebraic structures offer concise rules for the automatic code generation.
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain ...
详细信息
ISBN:
(纸本)9781450359863
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain vertex needs a large number of conflicting updates at the same time, leading to prohibitive performance degradation. this is particularly true and serious for processing natural graphs. In this paper, we have the insight that the atomic operations for the vertex updating of many graph algorithms (e.g., BFS, PageRank, and WCC) are typically incremental and simplex. this hence allows us to parallelize the conflicting vertex updates in an accumulative manner. We architect AccuGraph, a novel graph-specific accelerator that can simultaneously process atomic vertex updates for massive parallelism while ensuring the correctness. A parallel accumulator is designed to remove the serialization in atomic protections for conflicting vertex updates through merging their results in parallel. Our implementation on Xilinx FPGA with a wide variety of typical graph algorithms shows that our accelerator achieves an average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in comparison with state-of-the-art ForeGraph (with its single-chip version).
Evaluation of the randomness quality of a random number generator requires an efficient suite of statistical tests which takes advantage of the processing power of today's multi-core processing power in order to c...
详细信息
ISBN:
(纸本)9781479965694
Evaluation of the randomness quality of a random number generator requires an efficient suite of statistical tests which takes advantage of the processing power of today's multi-core processing power in order to cope withthe large amount of data to be processed. While, in theory, most complex processingalgorithms can be tuned for concurrent execution, the solution will eventually reach a state in which a compromise needs to be made between the overall performance and the configurability and usability of the application. Our solution is based on completely re-designing the TestU01 architecture to include the notion of parallel computing as part of the general requirements, and not as a tool used for increasing performance. Implementation of this design is done using concepts from the object-oriented paradigm, and uses the. NET Task parallel Library. Experimental results show that the parallel OOP based implementation of the TestU01 library not only obtains similar results as the previous parallel version, but in some cases a better speedup is obtained.
Two real-valued signal models based on selective spanning with fast enumeration (SSFE) and layered orthogonal lattice detector (LORD) algorithms are implemented on a Nvidia graphics processing unit (GPU). A 2 x 2 mult...
详细信息
this two volume set LNCS 7016 and LNCS 7017 constitutes the refereed proceedings of the 11thinternationalconference on algorithms and architectures for parallelprocessing, ICA3PP 2011, held in Melbourne, Australia,...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
this two volume set LNCS 7016 and LNCS 7017 constitutes the refereed proceedings of the 11thinternationalconference on algorithms and architectures for parallelprocessing, ICA3PP 2011, held in Melbourne, Australia, in October 2011. the second volume includes 37 papers from one symposium and three workshops held together with ICA3PP 2011 main conference. these are 16 papers from the 2011 international Symposium on Advances of Distributed Computing and Networking (ADCN 2011), 10 papers of the 4th IEEE international Workshop on Internet and Distributed Computing Systems (IDCS 2011), 7 papers belonging to the III international Workshop on Multicore and Multithreaded architectures and algorithms (M2A2 2011), as well as 4 papers of the 1st IEEE international Workshop on parallelarchitectures for Bioinformatics Systems (HardBio 2011).
Reconfigurable Computing has been evolving as a new platform for satisfying the simultaneous demand for application performance and flexibility placed over the present day DSP market. Since signal processing algorithm...
详细信息
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to div...
详细信息
ISBN:
(纸本)9783642246685
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to divide point-multiplication over ring Zn into two different point-multiplications over finite field and to compute them respectively. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the basic parallelalgorithms in conic curves cryptosystem. A quantitative performance analysis is made to compare this algorithm with two other algorithms we designed before. the performance comparison demonstrates that the algorithm presented in this paper can reduce time complexity of point-multiplication on conic curves over ring Zn and it is more efficient than the preceding ones.
this paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each impleme...
详细信息
ISBN:
(纸本)9780769530895
this paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. the first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. the second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code withthe FLAME/FLASH API, which allows matrices stared by blocks to be viewed and managed as matrices of matrix blocks. the SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer Scalability of the solution is demonstrated on ccNUMA platform with 16 processors and an SMP architecture with 16 cores.
It is shown how a parallel object model can be used as a support environment for massively parallelarchitectures based on transputer technology. the intention is to verify that parallelism integrates well with such p...
详细信息
ISBN:
(纸本)0818621338
It is shown how a parallel object model can be used as a support environment for massively parallelarchitectures based on transputer technology. the intention is to verify that parallelism integrates well with such properties of the object paradigm as abstraction, uniformity, and dynamicity. the authors also present the guidelines to build prototypes by an approach based on primitives. In particular, the implemented primitives make possible the creation and communication of objects for a massively parallel architecture. Finally, trends in future work--static and dynamic allocation, replication and persistency of objects--are outlined.
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that c...
详细信息
ISBN:
(纸本)9781538674796
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that can be at the order of hundreds of millions of characters long) would require excessive processing power and memory bandwidth. Sequence alignment algorithms can potentially benefit from the processing power of massive parallel processors due their simple arithmetic operations, coupled withthe inherent fine-grained and coarse-grained parallelism that they exhibit. However, the limited memory bandwidth in conventional computing systems prevents exploiting the maximum achievable speedup. In this paper, we propose a processing-in-memory architecture as a viable solution for the excessive memory bandwidth demand of bioinformatics applications. the design is composed of a set of simple and lightweight processing elements, customized to the sequence alignment algorithm, integrated at the logic layer of an emerging 3D DRAM architecture. Experimental results show that the proposed architecture results in up to 2.4x speedup and 41% reduction in power consumption, compared to a processor-side parallel implementation.
暂无评论