parallel processors such as Graphics processing units (GPUs) have emerged as co-processing units for central processing units (CPUs) to accelerate different applications. Open Computing Language (OpenCL) is a framewor...
详细信息
ISBN:
(纸本)9781467358057
parallel processors such as Graphics processing units (GPUs) have emerged as co-processing units for central processing units (CPUs) to accelerate different applications. Open Computing Language (OpenCL) is a framework for multiprocessing in heterogeneous platforms. In this paper we focus on motion estimation which is the most time consuming task in video coding. We study two motion estimation algorithms in terms of parallel execution. We implemented the full search algorithm and the hierarchical search algorithm with OpenCL and with C code. Our measurements show that the OpenCL-based implementations of the algorithms on the GPU can achieve nearly 10 times speedup compared to the corresponding C implementation on a single CPU.
Data sorting is used in many fields and plays an important role in defining the overall speed and performance. there are many sorting categories. In this study, two of these sorting algorithmsthat are bitonic sort an...
详细信息
ISBN:
(纸本)9781479933433
Data sorting is used in many fields and plays an important role in defining the overall speed and performance. there are many sorting categories. In this study, two of these sorting algorithmsthat are bitonic sort and radix sort are dealt with. We have designed and developed Radix Sort and Bitonic Sort algorithms for many core Graphics processing Units (GPUs). Although bitonic sort is a concurrent sorting algorithm, radix sort is a distribution sorting algorithm, i.e. both of these algorithms are not usual sorting algorithms. they can be parallelized on GPUs easily to get better performance than other sorting algorithms. We parallelized these sorting algorithms on many core GPUs using the Compute Unified Device Architecture (CUDA) platform, developed by NVIDIA Corporation and got some performance measurements.
In the past, parallelalgorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) (where n is the size of the input) and that if in practice the actual number was smaller,...
详细信息
ISBN:
(纸本)9783642382352
In the past, parallelalgorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) (where n is the size of the input) and that if in practice the actual number was smaller, this could be resolved using Brent's Lemma to simulate the highly parallel solution on a lower-degree parallel architecture. In this paper, however, we argue that design and implementation issues of algorithms and architectures are significantly different-both in theory and in practice-between computational models with high and low degrees of parallelism. We report an observed gap in the behavior of a parallel architecture depending on the number of processors. this gap appears repeatedly in both empirical cases, when studying practical aspects of architecture design and program implementation as well as in theoretical instances when studying the behaviour of various parallelalgorithms. It separates the performance, design and analysis of systems with a sub-linear number of processors and systems with linearly many processors. More specifically we observe that systems with either logarithmically many cores or with O(nα) cores (with α
For a long period in the development of computers and computing efficient applications were only characterized by computational - and memory complexity or in more practical terms elapsed computing time and required ma...
详细信息
ISBN:
(纸本)9783642400476
For a long period in the development of computers and computing efficient applications were only characterized by computational - and memory complexity or in more practical terms elapsed computing time and required main memory capacity. the history of Euro-Par and its predecessor-organizations stands for research on the development of ever more powerful computer architecturesthat shorten the compute time both by faster clocking and by parallel execution as well as the development of algorithmsthat can exhibit these parallel architectural features. the success of enhancing architectures and algorithms is best described by exponential curves regarding the peak computing power of architectures and the efficiency of algorithms. As microprocessor parts get more and more power hungry and electricity gets more and more expensive, "energy to solution" is a new optimization criterion for large applications. this calls for energy aware solutions.
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal ...
ISBN:
(纸本)9783642330643
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal attacks in distributed wireless sensor networks;optimization of a short-range proximity effect correction algorithm in e-beam lithography using GPGPUs;vectorized algorithms for Quadtree construction and descent;an optimal parallel prefix-sums algorithm on the memory machine models for GPUs;enhancing the performance of a distributed mobile computing environment by topology construction;maintaining consistency in software transactional memory through dynamic versioning tuning;a new low latency parallel turbo decoder employing parallel phase decoding method;high-performance matrix multiply on a massively multithreaded Fiteng1000 processor;and on construction of Cloud IaaS for VM live migration using KVM and OpenNebula.
this paper presents the implementation of ray-tracing-based algorithms for multi-objective geospatial optimization targeting various many-core processing technologies such as graphics processing units, x86 multi-cores...
详细信息
ISBN:
(纸本)9781479948970
this paper presents the implementation of ray-tracing-based algorithms for multi-objective geospatial optimization targeting various many-core processing technologies such as graphics processing units, x86 multi-cores, and ARM processors. High performance is achieved through highly parallel core algorithms, executed on multiple compute devices across a heterogeneous architecture using low-level OpenCL kernels. algorithms for calculating line-of-sight ballistic threat, visual observability, ground plane extraction, and Markov chain Monte Carlo optimization provide an augmented geospatial intelligence and situational awareness in three-dimensional urban environments.
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal ...
ISBN:
(纸本)9783642330773
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal attacks in distributed wireless sensor networks;optimization of a short-range proximity effect correction algorithm in e-beam lithography using GPGPUs;vectorized algorithms for Quadtree construction and descent;an optimal parallel prefix-sums algorithm on the memory machine models for GPUs;enhancing the performance of a distributed mobile computing environment by topology construction;maintaining consistency in software transactional memory through dynamic versioning tuning;a new low latency parallel turbo decoder employing parallel phase decoding method;high-performance matrix multiply on a massively multithreaded Fiteng1000 processor;and on construction of Cloud IaaS for VM live migration using KVM and OpenNebula.
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel ...
详细信息
ISBN:
(纸本)9781479907298
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithmsthat from a computational point of view can be implemented efficiently on parallelarchitectures fail to achieve significant speed-ups. the reason is very often that the speed-up possible withthe available execution units are poorly utilized due to inefficient data access. this paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. this is done by resolving memory conflicts by using padding. the method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. the execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.
Fractal organizations are a class of bio-inspired distributed hierarchical architectures in which control and feedback information are allowed to flow independently of the position the participating nodes have in the ...
详细信息
ISBN:
(纸本)9781479924813
Fractal organizations are a class of bio-inspired distributed hierarchical architectures in which control and feedback information are allowed to flow independently of the position the participating nodes have in the system hierarchy. In this paper we discuss the adoption of a fractal organization in a class of socio-technical systems characterized by a centralized architecture. We present the key architectural traits of the resulting Fractal Social Organization and put forward our conjecture that services based on the presented solution may exhibit significant improvements, e.g., in terms of scalability and performance. In order to provide elements to justify our conjecture we describe how we envision the use of the new organization in two different cases: a framework for semantic service description-and-matching and a low-cost telemonitoring service.
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessor...
详细信息
ISBN:
(纸本)9780769549392;9781467353212
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessors. In this work we exploit the benefits of using future many-core architectures, more specifically on-chip clustered many-core architectures. To achieve this goal we propose different representative data parallel versions of the original database scan and join algorithms. We also study the impact on the performance when on-chip memory, shared among all cores, is used as a prefetching buffer. For our experiments we study the behaviour of three queries from the standard DSS benchmark TPC-H executing on the Intel Single chip Cloud Computer experimental processor (Intel SCC). Our results show that parallelism can be well exploited by such architectures and how important it is to have a balance between computation and data intensity. Moreover, from our experimental results we show that performance improvement of 5x and 10x for the corresponding query implementation without data prefetching. Finally we show how we could efficiently use the system in order to achieve high power-performance efficiency when using the proposed prefetching buffer.
暂无评论