Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers shar...
详细信息
ISBN:
(纸本)9781467375894
Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.
the proceedings contain 10 papers. the topics discussed include: scaling parallel 3-D FFT with non-blocking MPI collectives;exploiting data representation for fault tolerance;VCube: a provably scalable distributed dia...
ISBN:
(纸本)9781479975624
the proceedings contain 10 papers. the topics discussed include: scaling parallel 3-D FFT with non-blocking MPI collectives;exploiting data representation for fault tolerance;VCube: a provably scalable distributed diagnosis algorithm;TX: algorithmic energy saving for distributed dense matrix factorizations;CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny qr factorization on a large-scale parallel system;deflation strategies to improve the convergence of communication-avoiding GMRES;a framework for parallel genetic algorithms for distributed memory architectures;the anatomy of Mr. Scan: a dissection of performance of an extreme scale GPU-based clustering algorithm;performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors;and a hierarchical tridiagonal system solver for heterogenous supercomputers.
Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. this work presents an approach to transparently (without programmer intervention)...
详细信息
ISBN:
(纸本)9781467395243
Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. this work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithmaware coarsening of messages sent during vertex visitation, reducing boththe number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131, 000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.
More and more computers use hybrid architectures combining multi-core processors and hardware accelerators such as graphics processing units (GPUs). We present in this paper a new method for scheduling efficiently par...
详细信息
More and more computers use hybrid architectures combining multi-core processors and hardware accelerators such as graphics processing units (GPUs). We present in this paper a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the application can be processed either on a core (CPU) or on a GPU. the objective is to minimize the maximum completion time (makespan). the corresponding scheduling problem is Non-deterministic Polynomial (NP)-time hard, Copyright (c) 2014 John Wiley & Sons, Ltd.
Recently, the OpenCL hardware-software co-design methodology has gained traction in realizing effective parallel architecture designs in heterogeneous FPGA platforms. In fact, the portability of OpenCL on hardware rea...
详细信息
ISBN:
(纸本)9781479982523
Recently, the OpenCL hardware-software co-design methodology has gained traction in realizing effective parallel architecture designs in heterogeneous FPGA platforms. In fact, the portability of OpenCL on hardware ready platforms such as GPU or multicore CPU enables ease of design verification. this is true especially for parallelalgorithms before implementing them using cumbersome HDL-based RTL design. In this paper we employed OpenCL programming platform based on Altera SDK for OpenCL (AOCL) to implement a Sobel filter algorithm as an image processing test case on a Cyclone V FPGA board. Using the portability of this platform, the performance of the kernel code is benchmarked against that of the GPU and multicore CPU implementations for different image and kernel sizes. Different optimization strategies are also applied for each platform. We found that increasing the Sobel filter kernel size from 3 x3 to 5 x 5 results in only 11.3% increase in computation time for FPGA, while the effect was much more significant where the execution time was as high as 23.6% and 85.7% for CPU and GPU, respectively.
During the last decade, processor architectures have emerged with hundreds and thousands of high speed processing cores in a single chip. these cores can work in parallel to share a work load for faster execution. thi...
详细信息
During the last decade, processor architectures have emerged with hundreds and thousands of high speed processing cores in a single chip. these cores can work in parallel to share a work load for faster execution. this paper presents performance evaluations on such multicore and many-core devices by mapping a computationally expensive correlation kernel of a template matching process using various programming models. the work builds a base performance case by a sequential mapping of the algorithm on an Intel processor. In the second step, the performance of the algorithm is enhanced by parallel mapping of the kernel on a shared memory multicore machine using OpenMP programming model. Finally, the Normalized Cross-Correlation (NCC) kernel is scaled to map on a many-core K20 GPU using CUDA programming model. In all steps, the correctness of the implementation of algorithm is taken care by comparing computed data with reference results from a high level implementation in MATLAB. the performance results are presented with various optimization techniques for MATLAB, Sequential, OpenMP and CUDA based implementations. the results show that GPU based implementation achieves 32x and 5x speed-ups respectively to the base case and multicore implementations respectively. Moreover, using inter-block sub-sampling on an 8-bit 4000×4000 reference gray-scale image achieves the execution time upto 2.8sec with an error growth less than 20% for the selected templates of size 96×96.
A large portion of image processing applications often come with stringent requirements regarding performance, energy efficiency, and power. FPGAs have proven to be among the most suitable architectures for algorithms...
详细信息
A large portion of image processing applications often come with stringent requirements regarding performance, energy efficiency, and power. FPGAs have proven to be among the most suitable architectures for algorithmsthat can be processed in a streaming pipeline. Yet, designing imaging systems for FPGAs remains a very time consuming task. High-Level Synthesis, which has significantly improved due to recent advancements, promises to overcome this obstacle. In particular, Altera OpenCL is a handy solution for employing an FPGA in a heterogeneous system as it covers all device communication. However, to obtain efficient hardware implementations, extreme code modifications, contradicting OpenCL's data-parallel programming paradigm, are necessary. In this work, we explore the programming methodology that yields significantly better hardware implementations for the Altera Offline Compiler. We furthermore designed a compiler back end for a domain-specific source-to-source compiler to leverage the algorithm description to a higher level and generate highly optimized OpenCL code. Moreover, we advanced the compiler to support arbitrary bit width operations, which are fundamental to hardware designs. We evaluate our approach by discussing the resulting implementations throughout an extensive application set and comparing them with example designs, provided by Altera. In addition, as we can derive multiple implementations for completely different target platforms from the same domain-specific language source code, we present a comparison of the achieved implementations in contrast to GPU implementations.
the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server pr...
详细信息
the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. this convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. this Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. the porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit boththe SoC CPU and GPU. the performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. the results obtained with all the developed versions are reported and compared withthose obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. the best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. the results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-ou...
详细信息
ISBN:
(纸本)9781479989379
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms, where threadblocks cooperatively process the task graph, and argue about their performance in terms of tasking throughput, atomics and memory IO overheads. Our initial results demonstrate a throughput of 38 million tasks/second on a Kepler K20X architecture.
暂无评论