Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonistic criteria should be optimized, such as throughput/perio...
详细信息
ISBN:
(纸本)9783540693833
Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonistic criteria should be optimized, such as throughput/period and latency (or a combination). Typical applications include digital image processing, where images are processed in steady-state mode. In this paper, we study the bi-criteria mapping (minimizing period and latency) of the JPEG encoding on a cluster of workstations. We present an integer linear programming formulation for this NP-hard problem, and we present an in-depth performance evaluation of several polynomial heuristics.
Multimedia is a key element in human-computer interaction systems. Multimedia applications, however, are among the most dominant computing workloads driving innovations in high performance and low power imaging system...
详细信息
ISBN:
(纸本)9783540705840
Multimedia is a key element in human-computer interaction systems. Multimedia applications, however, are among the most dominant computing workloads driving innovations in high performance and low power imaging systems. parallel implementations of multimedia applications mostly focus on the use of parallel computers. Modem general-purpose processors, however, have employed multimedia extensions (e.g., MMX, VIS, MAX, AltiVec) or subword parallel instructions to their instruction set architectures to improve the performance of multimedia. this paper quantitatively evaluates the impact of multimedia extensions on multiprocessor systems to exploit subword level parallelism (SLP) in addition to data level parallelism (DLP). Experimental results for a set of multimedia applications on a representative multiprocessor array shows that MMX (a representative Intel's multimedia extension) achieve an average speedup ranging from 3x to 5x over the same baseline multiprocessor array. MMX also outperforms baseline in both area efficiency (a 13% increase) and energy consumption (a 73% decrease), resulting in better component utilization and sustainable battery life. these results demonstrate that MMX is a suitable candidate for mobile multimedia computing systems.
STL dictionaries like map and set are commonly used in C++ programs. We consider parallelizing two of their bulk operations, namely the construction from many elements, and the insertion of many elements at a time. Pr...
详细信息
ISBN:
(数字)9783540784746
ISBN:
(纸本)9783540784722
STL dictionaries like map and set are commonly used in C++ programs. We consider parallelizing two of their bulk operations, namely the construction from many elements, and the insertion of many elements at a time. Practical algorithms are proposed for these tasks. the implementation is completely generic and engineered to provide best performance for the variety of possible input characteristics. It features transparent integration into the STL. this can make programs profit in an easy way from multi-core processing power. the performance measurements show the practical usefulness on real-world multi-core machines with up to eight cores.
this work proposes a load balance algorithm to parallelprocessing based on a variation of the classical knapsack problem. the problem considers the distribution of a set of partitions, defined by the number of cluste...
详细信息
ISBN:
(纸本)9783540928584
this work proposes a load balance algorithm to parallelprocessing based on a variation of the classical knapsack problem. the problem considers the distribution of a set of partitions, defined by the number of clusters, over a set of processors attempting to achieve a minimal overall processing cost. the work is an optimization for the parallel fuzzy c-means (FCM) clustering analysis algorithm proposed in a previous work composed by two distinct parts: the cluster analysis, properly said, using the FCM algorithm to calculate of clusters centers and the PBM index to evaluate partitions, and the load balance, which is modeled by the multiple knapsack problem and implemented through a heuristic that incorporates the restrictions related to cluster analysis in order to gives more efficiency to the parallel process.
We present a performance analysis of a scalable parallel data clustering algorithm with deterministic annealing for multicore systems that compares MPI and a new C# messaging runtime library CCR (Concurrency and Coord...
详细信息
ISBN:
(纸本)9783540693833
We present a performance analysis of a scalable parallel data clustering algorithm with deterministic annealing for multicore systems that compares MPI and a new C# messaging runtime library CCR (Concurrency and Coordination Runtime) with Windows and Linux and using boththreads and processes. We investigate effects of memory bandwidth and fluctuations of run times of loosely synchronized threads. We give results on message latency and bandwidth for two processor multicore systems based on AMD and Intel architectures with a total of four and eight cores. We compare our C# results with C using MPICH2 and Nemesis and Java with both mpiJava and MPJ Express. We show initial speedup results from Geographical Information Systems and Cheminformatics clustering problems. We abstract the key features of the algorithm and multicore systems that lead to the observed scalable parallel performance.
Originally developed by the consortium Sony-Toshiba-IBM for the Playstation 3 game console, the Cell Broadband Engine processor has been increasingly used in a much wider range of applications like HDTV sets and multi...
详细信息
ISBN:
(纸本)9783540928584
Originally developed by the consortium Sony-Toshiba-IBM for the Playstation 3 game console, the Cell Broadband Engine processor has been increasingly used in a much wider range of applications like HDTV sets and multimedia devices. Conforming the new Cell Broadband Engine Architecture that extends the PowerPC architecture, this processor can deliver high computational power embedding nine cores in a single chip: one general purpose PowerPC core and eight vector cores optimized for compute-intensive tasks. the processor's performance is enhanced by single-instruction-multiple-data (SIMD) instructions that allow to execute tip to four floating-point operations in one clock cycle. this multi-level parallel environment is highly suited to applications processing data streams: encryption/decryption, multimedia, image and signal processing, among others. this paper discusses the use of Cell BE to solve engineering problems and the practical aspects of the implementations of numerical method codes in this new architecture. To demonstrate the Cell BE programming techniques and the efficient porting of existing scalar algorithms to run on a multi-level parallel processor, the authors present the techniques applied to a well-known program for the solution of two dimensional elastostatic problems withthe Boundary Element Method. the programming guidelines provided here may also be extended to other numerical methods. Numerical experiments show the effectiveness of the proposed approach.
In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theor...
详细信息
In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theoretical delay limit, and Data Flow Graph transformations to achieve the iteration bound. We applied the methodology to some MD4-based hash algorithms such as SHA1, MD5 and RIPEMD-160. Since SHA1 is the algorithm which requires all the techniques we show, we also synthesized the transformed SHA1 algorithm in a 0.18 mu m CMOS technology in order to verify its correctness and its achievement of high throughput. To the best of our knowledge, the proposed SHA1 architecture is the first to achieve the theoretical throughput optimum beating all previously published results. though we demonstrate a limited number of examples, this design methodology can be applied to any other MD4-based hash algorithm.
the use of scientific computing centers becomes more and more difficult on modem parallelarchitectures. Users must face a large variety of batch systems (withtheir own specific syntax) and have to set many parameter...
详细信息
ISBN:
(纸本)9783540928584
the use of scientific computing centers becomes more and more difficult on modem parallelarchitectures. Users must face a large variety of batch systems (withtheir own specific syntax) and have to set many parameters to tune their applications (e.g., processors and/or threads mapping, memory resource constraints). Moreover, finding the optimal performance is not the only criteria when a pool of jobs is submitted on the Grid (for numerical parametric analysis for instance) and one must focus on the wall-time completion. In this work we tackle the problem by using the DIET Grid middleware that integrates an adaptable PASTIX service to solve a set of experiments issued from the simulations of the ASTER project.
this paper presents a novel feature-aware rendering system that automatically abstracts videos and images withthe goal of improving the effectiveness of imagery for visual communication tasks. We integrate the bilate...
详细信息
this paper presents a novel feature-aware rendering system that automatically abstracts videos and images withthe goal of improving the effectiveness of imagery for visual communication tasks. We integrate the bilateral grid to simplify regions of low contrast, which is faster than the separable approximation to the bilateral filter, and use a feature flow-guided anisotropic edge detection filter to enhance regions of high contrast. the edges detected in this paper are smoother, more coherent and stylistic than those of the isotropic difference-of-Gaussian filter. the presented algorithms are highly parallel, allowing a real-time performance on modern GPUs. the implementation of our approach is straightforward. Several experimental examples are given at the end of the paper to demonstrate the effectiveness of our approach.
As one of the most widely used bio-sequence searching tools, BLAST adopts index-based approach to detect the matches between two substrings by looking up a large table and processing one match per query. In this paper...
详细信息
ISBN:
(纸本)9783540786092
As one of the most widely used bio-sequence searching tools, BLAST adopts index-based approach to detect the matches between two substrings by looking up a large table and processing one match per query. In this paper, we propose a systolic array approach to detect string matches without using looking up tables. the pipelining systolic array is implemented as a multi-seeds detection and parallel extension pipeline engine to accelerate the first two stages of NCBI BLAST algorithm. Different from the index-based approach, our implementation consumes little memory resources and eliminates redundant string extensions by merging multiple adjoin seeds into a valid seed. Our FPGA implementation achieves superior performance results in both of processing element number and clock frequency over related works in the area of FPCA BLAST accelerators. the experimental results also show the speedup can reach about 17 and 48 compared to the NCBI BLASTp and TBLASTn programs for 3072-residue queries on Intel P4 CPU, respectively. Furthermore, the idea of multi-seeds detection also can be adopted in other seed-based heuristic searching applications.
暂无评论