Modern high performance computing and cloud computing infrastructures often leverage Graphic processing Units (GPUs) to provide accelerated, massively parallel computational power. this performance gain, however, may ...
详细信息
ISBN:
(纸本)9781538620748
Modern high performance computing and cloud computing infrastructures often leverage Graphic processing Units (GPUs) to provide accelerated, massively parallel computational power. this performance gain, however, may also introduce higher energy consumption. the energy challenge has become more and more pronounced when the system scales. To address this challenge, we propose Archon, a framework for supporting energy-efficient computing on CPU-GPU heterogeneous architectures. Specifically, Archon takes user's programs as input, automatically distribute the workload between CPU and GPU, and dynamically tunes the distribution ratio at runtime for an energy-efficient execution. Experiments have been carried out to evaluate the effectiveness of Archon, and the results show that it can achieve considerable energy savings at runtime, without significant efforts from the programmers.
Spherical harmonics serve as basis functions on the unit sphere and spherical harmonic transform is required in analysis and processing of signals in the spectral domain. We investigate the possibility of parallel com...
详细信息
ISBN:
(纸本)9781457711800
Spherical harmonics serve as basis functions on the unit sphere and spherical harmonic transform is required in analysis and processing of signals in the spectral domain. We investigate the possibility of parallel computation of spherical harmonic transform using Compute Unified Device Architecture (CUDA) with no communication between parallel kernels. We identify the parallel components in the widely used spherical harmonic transform method proposed by Driscoll and Healy. We provide the implementation details and compare the computational complexity withthe sequential algorithm. For a given bandlimited signal with maximum spherical harmonics degree L, using the O(L) number of parallelprocessing kernels, we present that the spherical harmonic coefficients can be calculated in O(Llog(2) L) time as compared to O(L-2 log(2) L). For corroboration, we provide the simulation results using CUDA which indicate the reduction in computational complexity
A key requirement for the effective use of multiprocessor systems in real-world applications is an ability to accurately predict the performance of a specific algorithm on a specific architecture. Such performance pre...
详细信息
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be ...
详细信息
ISBN:
(纸本)9781538610428
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. In particular, for the bipartite-graph partial coloring (BGPC) and distance-2 graph coloring (D2GC) problems, which have various use-cases within the scientific computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs. In this work, we propose parallelalgorithms for bipartite-graph partial coloring on shared-memory architectures. Compared to the existing shared-memory BGPC algorithms, the proposed ones employ greedier and more optimistic techniques that yield a better parallel coloring performance. In particular, on 16 cores, the proposed algorithms are more than 4x faster than their counterparts in the ColPack library which is, to the best of our knowledge, the only publicly-available coloring library for multicore architectures. In addition to BGPC, the proposed techniques are employed to devise parallel distance-2 graph coloring algorithms and similar performance improvements have been observed. Finally, we propose two costless balancing heuristics for BGPC that can reduce the skewness and imbalance on the cardinality of color sets (almost) for free. the heuristics can also be used for the D2GC problem and in general, they will probably yield a better color-based parallelization performance especially on many-core architectures.
Embedded Block Coding with Optimized Truncation (EBCOT) algorithm plays a basic and crucial part in JPEG2000 still image compression system. this paper proposes a VLSI architecture of EBCOT, in which a Dynamic Memory ...
详细信息
ISBN:
(纸本)078037889X
Embedded Block Coding with Optimized Truncation (EBCOT) algorithm plays a basic and crucial part in JPEG2000 still image compression system. this paper proposes a VLSI architecture of EBCOT, in which a Dynamic Memory Control (DMC) strategy is used to reduce 60% scale of the on-chip wavelet coefficients storage. A parallel architecture is proposed to speed-up the coding process. this architecture can be used as a compact and efficient IP core for JPEG2000 VLSI implementation and various real-time image&video applications.
Data mining tools may be computationally demanding, so there is an increasing interest on parallel computing strategies to improve their performance. the popularization of Graphics processing Units GPUs) increased the...
详细信息
Data mining tools may be computationally demanding, so there is an increasing interest on parallel computing strategies to improve their performance. the popularization of Graphics processing Units GPUs) increased the computing power of current desktop computers, but desktop-based data mining tools do not usually take full advantage of these architectures. this paper exploits an approach to improve the performance of Weka, a popular data mining tool, through parallelization on GPU-accelerated machines. From the profiling of Weka object-oriented code, we chose to parallelize a matrix multiplication method using state-of-the-art tools. the implementation was merged into Weka so that we could analyze the impact of parallel execution on its performance. the results show a significant speedup on the target parallelarchitectures, compared to the original, sequential Weka code. (C) 2014 the Authors. Published by Elsevier B.V.
this paper develops and evaluates a low-cost parallel computing platform for the implementation of parallelalgorithms in Power Engineering applications. the proposed approach utilises an existing local area network w...
详细信息
ISBN:
(纸本)0852967918
this paper develops and evaluates a low-cost parallel computing platform for the implementation of parallelalgorithms in Power Engineering applications. the proposed approach utilises an existing local area network without incurring any additional hardware costs. Application of computational intelligence techniques based on the developed computing platform to the economic dispatch problem is outlined. the performance of genetic algorithms in parallel and cluster structures and their abilities in coping time constraint applications are also demonstrated. It is found that when the workload is large, a parallel computing structure should be exploited for cost-effectiveness purpose.
ADAS (Advanced Driver Assistance Systems) algorithms increasingly use heavy image processing operations. To embed this type of algorithms, semiconductor companies offer many heterogeneous architectures. these SoCs (Sy...
详细信息
ISBN:
(纸本)9781479989379
ADAS (Advanced Driver Assistance Systems) algorithms increasingly use heavy image processing operations. To embed this type of algorithms, semiconductor companies offer many heterogeneous architectures. these SoCs (System on Chip) are composed of different processing units, with different capabilities, and often with massively parallel computing unit. Due to the complexity of these SoCs, predicting if a given algorithm can be executed in real time on a given architecture is not trivial. In fact it is not a simple task for automotive industry actors to choose the most suited heterogeneous SoC for a given application. Moreover, embedding complex algorithms on these systems remains a difficult task due to heterogeneity, it is not easy to decide how to allocate parts of a given algorithm on the different computing units of a given SoC. In order to help automotive industry in embedding algorithms on heterogeneous architectures, we propose a novel approach to predict performances of image processingalgorithms applicable on different types of computing units. Our methodology is able to predict a more or less wide interval of execution time with a degree of confidence using only high level description of algorithms, and a few characteristics of computing units.
Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processing a...
详细信息
ISBN:
(纸本)9781538669792
Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processingalgorithms are inherently parallel in nature, so they fit nicely into parallelarchitectures multicore Central processing Unit (CPU) and Graphics processing Unit GPUs. In this paper image processingalgorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU. All algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing. Relative speedups compared to CPU were given for all algorithms. TensorFlow GPU implementation can outperform multi-core CPUs for tested algorithms, obtained speedups range from 3.6 to 15 times.
In-place data manipulation is very desirable in many-core architectures with limited on-board memory. this paper deals withthe in-place implementation of a class of primitives that perform data movements in one direc...
详细信息
ISBN:
(纸本)9781467375887
In-place data manipulation is very desirable in many-core architectures with limited on-board memory. this paper deals withthe in-place implementation of a class of primitives that perform data movements in one direction. We call these primitives Data Sliding (DS) algorithms. Notable among them are relational algebra primitives (such as select and unique), padding to insert empty elements in a data structure, and stream compaction to reduce memory requirements. their in-place implementation in a bulk synchronous parallel model, such as GPUs, is specially challenging due to the difficulties in synchronizing threads executing on different compute units. Using a novel adjacent work-group synchronization technique, we propose two algorithmic schemes for regular and irregular DS algorithms. With a set of 5 benchmarks, we validate our approaches and compare them to the state-of-the-art implementations of these benchmarks. Our regular DS algorithms demonstrate up to 9.11x and 73.25x on NVIDIA and AMD GPUs, respectively, the throughput of their competitors. Our irregular DS algorithms outperform NVIDIA thrust library by up to 3.24x on the three most recent generations of NVIDIA GPUs.
暂无评论