检索结果-内蒙古大学图书馆

44th Annual international conference on parallel processing Workshops (ICPPW)

作者： Melot, Nicolas Janzen, Johan Kessler, Christoph Linkoping Univ Linkoping Sweden Uppsala Univ Uppsala Sweden

ISBN: (纸本)9781467375894

Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.

关键词： Benchmark testing Energy consumption Processor scheduling Schedules Switches throughput Time-frequency analysis benchmark crown scheduling energy frequency many-core scaling scc scheduling streaming voltage Benchmark testing scheduling of multiprocessor Streaming Time frequency analysis throughput Scheduling algorithms Switches energy consumption BENCHMARKS Stress corrosion cracking Scale formation VOLTAGE Scaling

来源：评论

学校读者我要写书评

暂无评论

Proceedings of ScalA 2014: 5th Workshop on Latest Advances in Scalable algorithms for Large-Scale Systems - held in conjunction with SC 2014: the international conference for High Performance Computing, Networking, Storage and Analysis

Proceedings of ScalA 2014: 5th Workshop on Latest Advances i...

引用

5th Workshop on Latest Advances in Scalable algorithms for Large-Scale Systems, ScalA 2014

ISBN: (纸本)9781479975624

the proceedings contain 10 papers. the topics discussed include: scaling parallel 3-D FFT with non-blocking MPI collectives;exploiting data representation for fault tolerance;VCube: a provably scalable distributed diagnosis algorithm;TX: algorithmic energy saving for distributed dense matrix factorizations;CholeskyQR2: a simple and communication-avoiding algorithm for computing a tall-skinny qr factorization on a large-scale parallel system;deflation strategies to improve the convergence of communication-avoiding GMRES;a framework for parallel genetic algorithms for distributed memory architectures;the anatomy of Mr. Scan: a dissection of performance of an extreme scale GPU-based clustering algorithm;performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors;and a hierarchical tridiagonal system solver for heterogenous supercomputers.

关键词：

来源：评论

学校读者我要写书评

暂无评论

An Algorithmic Approach to Communication Reduction in parallel Graph algorithms 24

An Algorithmic Approach to Communication Reduction in Parall...

引用

24th international conference on parallel Architecture and Compilation (PACT)

作者： Harshvardhan Fidel, Adam Amato, Nancy M. Rauchwerger, Lawrence Texas A&M Univ Dept Comp Sci & Engn Parasol Lab College Stn TX 77843 USA

ISBN: (纸本)9781467395243

Graph algorithms on distributed-memory systems typically perform heavy communication, often limiting their scalability and performance. this work presents an approach to transparently (without programmer intervention) allow fine-grained graph algorithms to utilize algorithmic communication reduction optimizations. In many graph algorithms, the same information is communicated by a vertex to its neighbors, which we coin algorithmic redundancy. Our approach exploits algorithmic redundancy to reduce communication between vertices located on different processing elements. We employ algorithmaware coarsening of messages sent during vertex visitation, reducing both the number of messages and the absolute amount of communication in the system. To achieve this, the system structure is represented by a hierarchical graph, facilitating communication optimizations that can take into consideration the machine's memory hierarchy. We also present an optimization for small-world scale-free graphs wherein hub vertices (i.e., vertices of very large degree) are represented in a similar hierarchical manner, which is exploited to increase parallelism and reduce communication. Finally, we present a framework that transparently allows fine-grained graph algorithms to utilize our hierarchical approach without programmer intervention, while improving scalability and performance. Experimental results of our proposed approach on 131, 000+ cores show improvements of up to a factor of 8 times over the non-hierarchical version for various graph mining and graph analytics algorithms.

关键词： parallel graph processing graph analytics big data

来源：评论

学校读者我要写书评

暂无评论

Scheduling independent tasks on multi-cores with GPU accelerators

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2015年第6期27卷 1625-1638页

作者： Bleuse, Raphael Kedad-Sidhoum, Safia Monna, Florence Mounie, Gregory Trystram, Denis Univ Paris 06 Sorbonne Univ UMR 7606 LIP6 F-75005 Paris France Univ Grenoble Alpes LIG F-38334 Saint Ismier France Inst Univ France Paris France

More and more computers use hybrid architectures combining multi-core processors and hardware accelerators such as graphics processing units (GPUs). We present in this paper a new method for scheduling efficiently parallel applications with m CPUs and k GPUs, where each task of the application can be processed either on a core (CPU) or on a GPU. the objective is to minimize the maximum completion time (makespan). the corresponding scheduling problem is Non-deterministic Polynomial (NP)-time hard, Copyright (c) 2014 John Wiley & Sons, Ltd.

关键词： scheduling approximation algorithms parallel heterogeneous systems

来源：评论

学校读者我要写书评

暂无评论

algorithms and architectures for parallel processing: ICA3PP international workshops and symposiums, zhangjiajie, China, november 18-20, 2015, proceedings 15th

Algorithms and architectures for parallel processing: ICA3PP...

引用

15th international conference on algorithms and architectures for parallel processing, ICA3PP 2015

作者： Wang, Guojun Zomaya, Albert Perez, Gregorio Martinez Li, Kenli Central South University Changsha China The University of Sydney SydneyNSW Australia University of Murcia Murcia Spain Hunan University Changsha China

来源：评论

学校读者我要写书评

暂无评论

OpenCL-based Hardware-Software Co-design Methodology for Image processing Implementation on Heterogeneous FPGA Platform 5

OpenCL-based Hardware-Software Co-design Methodology for Ima...

引用

5th IEEE international conference on Control System, Computing and Engineering (ICCSCE)

作者： Ayat, Sayed Omid Khalil-Hani, Mohamed Bakhteri, Rabia Univ Teknol Malaysia Fac Elect Engn VeCAD Res Lab Skudai 81310 Malaysia

ISBN: (纸本)9781479982523

Recently, the OpenCL hardware-software co-design methodology has gained traction in realizing effective parallel architecture designs in heterogeneous FPGA platforms. In fact, the portability of OpenCL on hardware ready platforms such as GPU or multicore CPU enables ease of design verification. this is true especially for parallel algorithms before implementing them using cumbersome HDL-based RTL design. In this paper we employed OpenCL programming platform based on Altera SDK for OpenCL (AOCL) to implement a Sobel filter algorithm as an image processing test case on a Cyclone V FPGA board. Using the portability of this platform, the performance of the kernel code is benchmarked against that of the GPU and multicore CPU implementations for different image and kernel sizes. Different optimization strategies are also applied for each platform. We found that increasing the Sobel filter kernel size from 3 x3 to 5 x 5 results in only 11.3% increase in computation time for FPGA, while the effect was much more significant where the execution time was as high as 23.6% and 85.7% for CPU and GPU, respectively.

关键词： parallel computing image processing multiprocessor FPGA GPU OpenCL

来源：评论

学校读者我要写书评

暂无评论

Template matching of aerial images using GPU

Template matching of aerial images using GPU

引用

international Bhurban conference on Applied Sciences & Technology, IBCAST

作者： Nabigha Nazneen Muhammad Shafiq Abdul Hameed KICSIT Rawalpindi Pakistan CESAT Islamabad Pakistan

During the last decade, processor architectures have emerged with hundreds and thousands of high speed processing cores in a single chip. these cores can work in parallel to share a work load for faster execution. this paper presents performance evaluations on such multicore and many-core devices by mapping a computationally expensive correlation kernel of a template matching process using various programming models. the work builds a base performance case by a sequential mapping of the algorithm on an Intel processor. In the second step, the performance of the algorithm is enhanced by parallel mapping of the kernel on a shared memory multicore machine using OpenMP programming model. Finally, the Normalized Cross-Correlation (NCC) kernel is scaled to map on a many-core K20 GPU using CUDA programming model. In all steps, the correctness of the implementation of algorithm is taken care by comparing computed data with reference results from a high level implementation in MATLAB. the performance results are presented with various optimization techniques for MATLAB, Sequential, OpenMP and CUDA based implementations. the results show that GPU based implementation achieves 32x and 5x speed-ups respectively to the base case and multicore implementations respectively. Moreover, using inter-block sub-sampling on an 8-bit 4000×4000 reference gray-scale image achieves the execution time upto 2.8sec with an error growth less than 20% for the selected templates of size 96×96.

关键词： Graphics processing units MATLAB Kernel Computational modeling Correlation Instruction sets Multicore processing

来源：评论

学校读者我要写书评

暂无评论

FPGA-based accelerator design from a domain-specific language

FPGA-based accelerator design from a domain-specific languag...

引用

international conference on Field Programmable Logic and Applications

作者： M. Akif Özkan Oliver Reiche Frank Hannig Jürgen Teich Friedrich-Alexander University of Erlangen-Ntirnberg (FAU) Germany

A large portion of image processing applications often come with stringent requirements regarding performance, energy efficiency, and power. FPGAs have proven to be among the most suitable architectures for algorithms that can be processed in a streaming pipeline. Yet, designing imaging systems for FPGAs remains a very time consuming task. High-Level Synthesis, which has significantly improved due to recent advancements, promises to overcome this obstacle. In particular, Altera OpenCL is a handy solution for employing an FPGA in a heterogeneous system as it covers all device communication. However, to obtain efficient hardware implementations, extreme code modifications, contradicting OpenCL's data-parallel programming paradigm, are necessary. In this work, we explore the programming methodology that yields significantly better hardware implementations for the Altera Offline Compiler. We furthermore designed a compiler back end for a domain-specific source-to-source compiler to leverage the algorithm description to a higher level and generate highly optimized OpenCL code. Moreover, we advanced the compiler to support arbitrary bit width operations, which are fundamental to hardware designs. We evaluate our approach by discussing the resulting implementations throughout an extensive application set and comparing them with example designs, provided by Altera. In addition, as we can derive multiple implementations for completely different target platforms from the same domain-specific language source code, we present a comparison of the achieved implementations in contrast to GPU implementations.

关键词： Kernel Hardware Computer architecture Programming Field programmable gate arrays Image processing

来源：评论

学校读者我要写书评

暂无评论

X-Ray Computed Tomography Applied to Objects of Cultural Heritage: Porting and Testing the Filtered Back-Projection Reconstruction Algorithm on Low Power Systems-on-Chip

X-Ray Computed Tomography Applied to Objects of Cultural Her...

引用

Euromicro conference on parallel, Distributed and Network-Based processing

作者： Elena Corni Lucia Morganti Maria Pia Morigi Rosa Brancaccio Matteo Bettuzzi Giuseppe Levi Eva Peccenini Daniele Cesini Andrea Ferraro Department of Physics and Astronomy University of Bologna Bologna Italy Bologna Section Italian Institute for Nuclear Physics (INFN) Bologna Italy Enrico Fermi Center for Study and Research Roma Italy CNAF Section Italian Institute for Nuclear Physics (INFN) Bologna Italy

the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. this convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. this Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. the porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit both the SoC CPU and GPU. the performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. the results obtained with all the developed versions are reported and compared with those obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. the best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. the results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.

关键词： Graphics processing units Image reconstruction Computer architecture Filtering algorithms Computed tomography Radiography

来源：评论

学校读者我要写书评

暂无评论

CUDA Grid-Level Task Progression algorithms 17

CUDA Grid-Level Task Progression Algorithms

引用

2015 IEEE 17th international conference on High Performance Computing and Communications (HPCC)

作者： Kartsaklis, Christos Joubert, Wayne Hernandez, Oscar R. Eisenbach, Markus Elwasif, Wael R. Bernholdt, David E. Oak Ridge Natl Lab Oak Ridge TN 37831 USA

ISBN: (纸本)9781479989379

Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms, where threadblocks cooperatively process the task graph, and argue about their performance in terms of tasking throughput, atomics and memory IO overheads. Our initial results demonstrate a throughput of 38 million tasks/second on a Kepler K20X architecture.

关键词： parallel architectures parallel programming CUDA grid-level task progression algorithm CUDA task progression algorithm Kepler K20X architecture parallel programming model wavefront computational pattern Computational modeling Graphics processing units Instruction sets Kernel parallel processing Radiation detectors Runtime

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：