As threads of execution in a multi-programmed computing environment have different characteristics and hardware resource requirements, heterogeneous multi-core processors can achieve higher performance as well as powe...
详细信息
Computational challenges for the one-to-many and many-to-many protein structure comparison (PSC) problem are a result of several factors: constantly expanding large-size structural proteomics databases, high computati...
详细信息
Graphics processing units (GPUs) have been explored as a new computing paradigm for accelerating computation intensive applications. In particular, the combination between GPUs and CPU has proved to be an effective so...
详细信息
this paper presents Fast Fourier Transform (FFT) benchmark results to measure and compare the performance of various DSP and Intel processors for underwater signal processing applications. this paper aims to show perf...
详细信息
ISBN:
(纸本)9781467344258
this paper presents Fast Fourier Transform (FFT) benchmark results to measure and compare the performance of various DSP and Intel processors for underwater signal processing applications. this paper aims to show performance enhancement in Intel processors as compared to DSP processors by using parallel programming for implementing signal processing functions in real time. this paper provides results that show a significant decrease in FFT execution time on an Intel based Multicore processor using parallel programming. therefore comparative analysis among different processor architectures presented in this paper will help the system designers in selecting an optimal processor for underwater signal processing applications.
Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need for maste...
详细信息
Proximity queries consists in retrieving objects near a given query. To avoid a brute force scan over a large database, an index can be used. However, for some problems, indexes are mostly useless (their running times...
详细信息
In this paper we concentrate on embedded parallelarchitectures with heterogeneous memory management systems combining shared and local memories, and more precisely we focus on efficient data communications between th...
详细信息
In this paper we concentrate on embedded parallelarchitectures with heterogeneous memory management systems combining shared and local memories, and more precisely we focus on efficient data communications between the various architecture parts. We formulate explicit data transfers in a polyhedral context and give several strategies for managing efficient communications for redundantly stored/read data. this allows automatic DMA-style code generation for a variety of data mappings onto parallelprocessing elements. Our approach is validated on a wide series of data redistribution examples linked with a domain-specific parallelisation framework developed in thales, SpearDE. We give the solution for efficient data transfers mathematically as well as under the form of generated C code.
Graph500 is a benchmark suite for big data analysis. Matrices used for Graph500 inherit the properties of graph analysis such as breadth first search for SNS and PageRank for web searching engine. Especially power sav...
详细信息
the transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the computational costs depend just on the periodic part of the lattice and not on the whole. However, even when t...
详细信息
the transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the computational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T (|V |, |E|) needed to compute the matrix grows exponentially as a function of the graph width. In this work, we present a parallel transfer-matrix implementation that scales performance under multi-core architectures. the construction of the matrix is based on several repetitions of the deletion-contraction technique, allowing parallelism suitable to multi-core machines. Our experimental results show that the multi-core implementation achieves speedups of 3.7X with p = 4 processors and 5.7X with p = 8. the efficiency of the implementation lies between 60% and 95%, achieving the best balance of speedup and efficiency at p = 4 processors for actual multi-core architectures. the algorithm also takes advantage of the lattice symmetry, making the transfer matrix computation to run up to 2X faster than its non-symmetric counterpart and use up to a quarter of the original space.
the LU decomposition is a widely used method to solve the dense linear algebra in many scientific computation applications. In recent years, the single instruction multiple data (SIMD) technology has been a popular me...
详细信息
ISBN:
(纸本)9783642408199;9783642408205
the LU decomposition is a widely used method to solve the dense linear algebra in many scientific computation applications. In recent years, the single instruction multiple data (SIMD) technology has been a popular method to accelerate the LU decomposition. However, the pipeline parallelism and memory bandwidth utilization are low when the LU decomposition mapped onto SIMD processors. this paper proposes a fine-grained pipelined implementation of LU decomposition on SIMD processors. the fine-grained algorithm well utilizes data dependences of the native algorithm to explore the fine-grained parallelism among all the computation resources. By transforming the non-coalesced memory access to coalesced version, the proposed algorithm can achieve the high pipeline parallelism and the high efficient memory access. Experimental results show that the proposed technology can achieve a speedup of 1.04x to 1.82x over the native algorithm and can achieve about 89% of the peak performance on the SIMD processor.
暂无评论