parallel processors such as Graphics processing units (GPUs) have emerged as co-processing units for central processing units (CPUs) to accelerate different applications. Open Computing Language (OpenCL) is a framewor...
详细信息
ISBN:
(纸本)9781467358057
parallel processors such as Graphics processing units (GPUs) have emerged as co-processing units for central processing units (CPUs) to accelerate different applications. Open Computing Language (OpenCL) is a framework for multiprocessing in heterogeneous platforms. In this paper we focus on motion estimation which is the most time consuming task in video coding. We study two motion estimation algorithms in terms of parallel execution. We implemented the full search algorithm and the hierarchical search algorithm with OpenCL and with C code. Our measurements show that the OpenCL-based implementations of the algorithms on the GPU can achieve nearly 10 times speedup compared to the corresponding C implementation on a single CPU.
作者:
Lenke, MLRR-TUM
Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik Technische Universit?t München 80290 München Germany
Typical applications of the so-called Grand Challenges need massively parallel computer system architectures. Tools like parallel debuggers, performance analysers and visualizers help the code designer to develop effi...
详细信息
Typical applications of the so-called Grand Challenges need massively parallel computer system architectures. Tools like parallel debuggers, performance analysers and visualizers help the code designer to develop efficient parallelalgorithms. Such tools merely support the development cycle. But technical and scientific engineers who make use of parallel high-performance computing applications, e.g. numerical simulation algorithms in computational fluid dynamics (CFD), must be supported in their engineering work by another kind of tool. A tool for the application cycle is required because old, conventional suggestions regarding the arrangement for the application cycle rely on strictly sequential procedures. they are due to the heritage of traditional work on former vector computers. that formative influence is still felt in today's arrangements for the application cycle, prevents a more efficient engineering work and, therefore, must be overcome. New tool conceptions have to be introduced to enable on-line interaction between the technical and scientific engineers and their running parallel simulation. VIPER stands for VIsualization of parallel numerical simulation algorithms for Extended Research and offers physical parameters of the mathematical model and parameters of the numerical method as objects of a graphical user tool interface for online observation and online modification. A special client-server-client process architecture implementation enables technical and scientific engineers who are sitting at their graphic workstation to interact withtheir parallel simulation algorithms running on a remote parallel computer system. the VIPER prototype is applied on ParNsflex which is a parallel Navier-Stokes solver for real world aero-dynamic problems. A Paragon XP/S was selected as test parallel computer system. A first evaluation indicates the superiority of the VIPER conception against conventional procedures. Copyright (C) 1996 Published by Elsevier Science L
Relational database management systems (RDBMS) are still widely required by numerous business applications. Boosting performances without compromising functionalities represents a big challenge. To achieve this goal, ...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
Relational database management systems (RDBMS) are still widely required by numerous business applications. Boosting performances without compromising functionalities represents a big challenge. To achieve this goal, we propose to boost an existing RDBMS by making it able to use hardware architectures with high memory bandwidth like GPUs. In this paper we present a solution named CuDB. We compare the performances and energy efficiency of our approach with different GPU ranges. We focus on technical specificities of GPUs which are most relevant for designing high energy efficient solutions for database processing.
this paper presents a parallel architecture for analog-to-information converters. the converter's architecture shifts most of the computational burden to the digital domain to take full advantage of deep-submicron...
详细信息
ISBN:
(纸本)9781467308595
this paper presents a parallel architecture for analog-to-information converters. the converter's architecture shifts most of the computational burden to the digital domain to take full advantage of deep-submicron IC fabrication technologies. the analog part of the converter is composed of a single multibit first-order Delta-Sigma modulator. the proposed architecture is validated through numerical simulations. the results show that signal recovery improves significantly when a 2-bit and a 3-bit Delta-Sigma modulators are employed.
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However...
详细信息
ISBN:
(纸本)0769522297
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each thread's expected resource utilization but do not take into account variance in thread behavior As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. the proposed memory monitoring framework centers oil the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler this work is first motivated with experiments in a simulated environment, then validated on a Hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux Kernel.
Sliding window sums are widely used for string indexing, hashing, time series analysis and machine learning. New vector algorithms which utilize the advanced vector extension (AVX) instructions available on modern pro...
详细信息
ISBN:
(纸本)9783030389611;9783030389604
Sliding window sums are widely used for string indexing, hashing, time series analysis and machine learning. New vector algorithms which utilize the advanced vector extension (AVX) instructions available on modern processors, or the parallel compute units on GPUs and FPGAs, would provide a significant performance boost. We develop a generic vectorized sliding sum algorithm with speedup for window size w and number of processors P is O(P/w) for a generic sliding sum. For a sum with commutative operator the speedup is improved to O(P/log(w)). Implementing the algorithm for the bioinformatics application of minimizer based k-mer table generation using AVX instructions, we obtain a speedup of over 5x.
It is our great pleasure to welcome you to the proceedings of the 10th annual event of the internationalconference on algorithms and architectures for parallelprocessing (ICA3PP). ICA3PP is recognized as the main re...
详细信息
ISBN:
(数字)9783642131196
ISBN:
(纸本)9783642131189
It is our great pleasure to welcome you to the proceedings of the 10th annual event of the internationalconference on algorithms and architectures for parallelprocessing (ICA3PP). ICA3PP is recognized as the main regular event covering the many dimensions of parallelalgorithms and architectures, encompassing fundamental theoretical - proaches, practical experimental projects, and commercial components and systems. As applications of computing systems have permeated every aspect of daily life, the power of computing systems has become increasingly critical. therefore, ICA3PP 2010 aimed to permit researchers and practitioners from industry to exchange inf- mation regarding advancements in the state of the art and practice of IT-driven s- vices and applications, as well as to identify emerging research topics and define the future directions of parallelprocessing. We received a total of 157 submissions this year, showing by both quantity and quality that ICA3PP is a premier conference on parallelprocessing. In the first stage, all papers submitted were screened for their relevance and general submission - quirements. these manuscripts then underwent a rigorous peer-review process with at least three reviewers per paper. In the end, 47 papers were accepted for presentation and included in the main proceedings, comprising a 30% acceptance rate.
We propose a new approach to parallelizing fault simulation in which the test set is partitioned among the available processors. the approach can be used for any of the sequential circuit fault simulation algorithms c...
详细信息
ISBN:
(纸本)0818677554
We propose a new approach to parallelizing fault simulation in which the test set is partitioned among the available processors. the approach can be used for any of the sequential circuit fault simulation algorithms commonly used, and it can be implemented on various different parallelarchitectures. this approach for the first time overcomes the limitations of serial logic simulation. In addition, the excessive redundant computations required in the traditional fault-partitioning approach are also considerably reduced. Significant improvements in speedup were observed as compared to previous approaches. An average speedup of 5.7 was obtained for test set partitioning over 10 processors for the benchmark circuits studied. Although pessimistic fault coverage may be reported in some cases, the proposed approach was found to be very accurate for the circuits studied.
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in rea...
详细信息
ISBN:
(纸本)0769526365
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in real-world large scale networks, such as the low average distance, high local density, and heavy-tailed power law degree distributions. We test our implementations on real datasets such as the web graph, protein-interaction networks, movie-actor and citation networks, and report impressive parallel performance for evaluation of the computationally intensive centrality metrics (betweenness and closeness centrality) on high-end shared memory symmetric multiprocessor and multithreaded architectures. To our knowledge, these are the first parallel implementations of these widely-used social network analysis metrics. We demonstrate that it is possible to rigorously analyze networks three orders of magnitude larger than instances that can be handled by existing network analysis (SNA) software packages. For instance, we compute the exact betweenness centrality value for each vertex in a large US patent citation network (3 million patents, 16 million citations) in 42 minutes on 16 processors, utilizing 20GB RAM of the IBM p5 570. Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges.
暂无评论