检索结果-内蒙古大学图书馆

Toward a BLAS library truly portable across different accelerator types

JOURNAL OF SUPERCOMPUTING 2019年第11期75卷 7101-7124页

作者： Rodriguez-Gutiez, Eduardo Moreton-Fernandez, Ana Gonzalez-Escribano, Arturo Llanos, Diego R. Univ Valladolid Dept Informat Paseo Belen E-47011 Valladolid Spain

Scientific applications are some of the most computationally demanding software pieces. Their core is usually a set of linear algebra operations, which may represent a significant part of the overall run-time of the application. BLAS libraries aim to solve this problem by exposing a set of highly optimized, reusable routines. There are several implementations specifically tuned for different types of computing platforms, including coprocessors. Some examples include the one bundled with the Intel MKL library, which targets Intel CPUs or Xeon Phi coprocessors, or the cuBLAS library, which is specifically designed for NVIDIA GPUs. Nowadays, computing nodes in many supercomputing clusters include one or more different coprocessor types. To fully exploit these platforms might require programs that can adapt at run-time to the chosen device type, hardwiring in the program the code needed to use a different library for each device type that can be selected. This also forces the programmer to deal with different interface particularities and mechanisms to manage the memory transfers of the data structures used as parameters. This paper presents a unified, performance-oriented, and portable interface for BLAS. This interface has been integrated into a heterogeneous programming model (Controllers) which supports groups of CPU cores, Xeon Phi accelerators, or NVIDIA GPUs in a transparent way. The contribution of this paper includes: An abstraction layer to hide programming differences between diverse BLAS libraries;new types of kernel classes to support the context manipulation of different external BLAS libraries;a new kernel selection policy that considers both programmer kernels and different external libraries;a complete new Controller library interface for the whole collection of BLAS routines. This proposal enables the creation of BLAS-based portable codes that can execute on top of different types of accelerators by changing a single initialization parameter. Our softw

关键词： BLAS Parallel programming Scientific libraries heterogeneous programming Accelerators Coprocessors GPU Xeon Phi MIC CUDA

来源：评论

学校读者我要写书评

暂无评论

Controllers: An abstraction to ease the use of hardware accelerators

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2018年第6期32卷 838-853页

作者： Moreton-Fernandez, Ana Ortega-Arranz, Hector Gonzalez-Escribano, Arturo Univ Valladolid Valladolid Spain

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

关键词： Parallel programming GPUs CUDA heterogeneous programming

来源：评论

学校读者我要写书评

暂无评论

An OpenCL framework for high performance extraction of image features

引用

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 2017年 109卷 75-88页

作者： de Andrade, Douglas Coimbra Trabasso, Luis Gonzaga Petroleo Brasileiro SA Sao Paulo Brazil Aeronaut Inst Technol Mech Engn Div Sao Jose Dos Campos Brazil

Image features are widely used for object identification in many situations, including interpretation of data containing natural scenes captured by unmanned aerial vehicles. This paper presents a parallel framework to extract additive features (such as color features and histogram of oriented gradients) using the processing power of GPUs and multicore CPUs to accelerate the algorithms with the OpenCL language. The resulting features are available in device memory and then can be fed into classifiers such as SVM, logistic regression and boosting methods for object recognition. It is possible to extract multiple features with better performance. The GPU accelerated image integral algorithm speeds up computations up to 35x when compared to the single-thread CPU implementation in a test bed hardware. The proposed framework allows real-time extraction of a very large number of image features from full-HD images (better than 30 fps) and makes them available for access in coalesced order by GPU classification algorithms. (C) 2017 Elsevier Inc. All rights reserved.

关键词： OpenCL heterogeneous programming Image descriptors Additive features Haar features Histogram of oriented gradients Parallel processing

来源：评论

学校读者我要写书评

暂无评论

OneGraph: a cross-architecture framework for large-scale graph computing on GPUs based on oneAPI

引用

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING 2024年第2期6卷 179-191页

作者： Li, Shiyang Zhu, Jingyu Han, Jiaxun Peng, Yuting Wang, Zhuoran Gong, Xiaoli Wang, Gang Zhang, Jin Wang, Xuqiang Nankai Univ Colleage Comp Sci Tianjin 300350 Peoples R China State Grid Tianjin Informat & Commun Co Tianjin Peoples R China

The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing. To handle large-scale graph processing, heterogeneous platforms have become necessary to provide sufficient computing power and storage. The most popular scheme for this is the CPU-GPU architecture. However, the steep learning curve and complex concurrency control for heterogeneous platforms pose a challenge for developers. Additionally, GPUs from different vendors have varying software stacks, making cross-platform porting and verification challenging. Recently, Intel proposed a unified programming model to manage multiple heterogeneous devices at the same time, named oneAPI. It provides a more friendly programming model for simple C++ developers and a convenient concurrency control scheme, allowing managing different vendors of devices at the same time. Hence there is an opportunity to utilize oneAPI to design a general cross-architecture framework for large-scale graph computing. In this paper, we propose a large-scale graph computing framework for multiple types of accelerators with Intel oneAPI and we name it as OneGraph. Our approach significantly reduces the data transfer between GPU and CPU and masks the latency by asynchronous transfer, which significantly improves performance. We conducted rigorous performance tests on the framework using four classical graph algorithms. The experiment results show that our approach achieves an average speedup of 3.3x over the state-of-the-art partitioning-based approaches. Moreover, thanks to the cross-architecture model of Intel oneAPI, the framework can be deployed on different GPU platforms without code modification. And our evaluation proves that OneGraph has only less than 1% performance loss compared to the dedicated programming model on GPUs in large-scale graph computing.

关键词： heterogeneous programming Graph computing Out-of-memory process Cross-architecture portability OneAPI

来源：评论

学校读者我要写书评

暂无评论

Supporting Unified Shader Specialization by Co-opting C++ Features

引用

PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES 2022年第3期5卷 1–17页

作者： Seitz, Kerry A., Jr. Foley, Theresa Porumbescu, Serban D. Owens, John D. Univ Calif Davis Dept Comp Sci One Shields Ave Davis CA 95616 USA NVIDIA 2788 San Tomas Expressway Santa Clara CA 95051 USA Univ Calif Davis Dept Elect & Comp Engn One Shields Ave Davis CA 95616 USA

Modern unified programming models (such as CUDA and SYCL) that combine host (CPU) code and GPU code into the same programming language, same file, and same lexical scope lack adequate support for GPU code specialization, which is a key optimization in real-time graphics. Furthermore, current methods used to implement specialization do not translate to a unified environment. In this paper, we create a unified shader programming environment in C++ that provides first-class support for specialization by co-opting C++'s attribute and virtual function features and reimplementing them with alternate semantics to express the services required. By co-opting existing features, we enable programmers to use familiar C++ programming techniques to write host and GPU code together, while still achieving efficient generated C++ and HLSL code via our source-to-source translator.

关键词： Shaders Shading Languages Real-Time Rendering heterogeneous programming Unified programming

来源：评论

学校读者我要写书评

暂无评论

RenderKernel:High-level programming for real-time rendering systems

引用

Visual Informatics 2024年第3期8卷 82-95页

作者： Jinyuan Yang Soumyabrata Dev Abraham G.Campbell University College Dublin Ireland

Real-time rendering applications leverage heterogeneous computing to optimize ***,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural ***,the creation of such systems requires verbose and unsafe programming *** developments in domain-specific and unified shading languages aim to mitigate these ***,current programming models primarily address data layout consistency,neglecting other persistent *** this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering *** the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were *** model allows for early detection and prevention of errors due to system heterogeneity at ***,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous *** can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.

关键词： heterogeneous programming High-level programming Real-time rendering Rendering systems

来源：评论

学校读者我要写书评

暂无评论

OmpSs@cloudFPGA: An FPGA Task-Based programming Model with Message Passing 36

OmpSs@cloudFPGA: An FPGA Task-Based Programming Model with M...

引用

36th IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS)

作者： Miguel de Haro, Juan Cano, Ruben Alvarez, Carlos Jimenez-Gonzalez, Daniel Martorell, Xavier Ayguade, Eduard Labarta, Jeses Abel, Francois Ringlein, Burkhard Weiss, Beat Barcelona Supercomp Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Spain IBM Res Europe Zurich Switzerland

ISBN: (纸本)9781665481069

Nowadays, a new parallel paradigm for energyefficient heterogeneous hardware infrastructures is required to achieve better performance at a reasonable cost on highperformance computing applications. Under this new paradigm, some application parts are offloaded to specialized accelerators that run faster or are more energy-efficient than CPUs. FieldProgrammable Gate Arrays (FPGA) are one of those types of accelerators that are becoming widely available in data centers. This paper proposes OmpSs@cloudFPGA, which includes novel extensions to parallel task-based programming models that enable easy and efficient programming of heterogeneous clusters with FPGAs. The programmer only needs to annotate, with OpenMP-like pragmas, the tasks of the application that should be accelerated in the cluster of FPGAs. Next, the proposed programming model framework automatically extracts parts annotated with High-Level Synthesis (HLS) pragmas and synthesizes them into hardware accelerator cores for FPGAs. Additionally, our extensions include and support two novel features: 1) FPGA-toFPGA direct communication using a Message Passing Interface (MPI) similar Application programming Interface (API) with one-to-one and collective communications to alleviate host communication channel bottleneck, and 2) creating and spawning work from inside the FPGAs to their own accelerator cores based on an MPI rank-like identification. These features break the classical host-accelerator model, where the host (typically the CPU) generates all the work and distributes it to each accelerator. We also present an evaluation of OmpSs@cloudFPGA for different parallel strategies of the N-Body application on the IBM cloudFPGA research platform. Results show that for cluster sizes up to 56 FPGAs, the performance scales linearly. To the best of our knowledge, this is the best performance obtained for N-body over FPGA platforms, reaching 344 Gpairs/s with 56 FPGAs. Finally, we compare the performance and power consu

关键词： FPGA MPI OpenMP programming models network-attached FPGA stand-alone FPGA High-Level Synthesis heterogeneous programming High-performance computing

来源：评论

学校读者我要写书评

暂无评论

Using hStreams programming Library for Accelerating a Real-Life Application on Intel MIC 16th

Using hStreams Programming Library for Accelerating a Real-L...

引用

16th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP)

作者： Szustak, Lukasz Halbiniak, Kamil Kulawik, Adam Wyrzykowski, Roman Uminski, Piotr Sasinowski, Marcin Czestochowa Tech Univ Czestochowa Poland Intel Corp Santa Clara CA USA

ISBN: (纸本)9783319499567;9783319499550

The main goal of this paper is the suitability assessment of the hStreams programming library for porting a real-life scientific application to heterogeneous platforms with Intel Xeon Phi coprocessors. This emerging library offers a higher level of abstraction to provide effective concurrency among tasks, and control over the overall performance. In our study, we focus on applying the FIFO streaming model for a parallel application which implements the numerical model of alloy solidification. In the paper, we show how scientific applications can benefit from multiple streams. To take full advantages of hStreams, we propose a decomposition of the studied application that allows us to distribute tasks belonging to the computational core of the application among two logical streams within two logical/physical domains. Effective overlapping computations with data transfers is another goal achieved in this way. The proposed approach allows us to execute the whole application 3.5 times faster than the original parallel version running on two CPUs.

关键词： Intel MIC Hybrid architecture Numerical modeling of solidification heterogeneous programming Hstreams library Task and data parallelism

来源：评论

学校读者我要写书评

暂无评论

Overhauling SC Atomics in C11 and OpenCL 16

Overhauling SC Atomics in C11 and OpenCL

引用

43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of programming Languages (POPL)

作者： Batty, Mark Donaldson, Alastair F. Wickerson, John Univ Kent Canterbury CT2 7NZ Kent England Univ London Imperial Coll Sci Technol & Med London SW7 2AZ England

ISBN: (纸本)9781450335492

Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is that the SC operations in an execution no longer need to be totally ordered. This relaxation enables, for the first time, efficient and exhaustive simulation of litmus tests that use SC atomics. We extend our improved C11 model to obtain the first rigorous memory model formalisation for OpenCL (which extends C11 with support for heterogeneous many-core programming) In the OpenCL setting, we refine the SC axioms still further to give a sensible semantics to SC operations that employ a 'memory scope' to restrict their visibility to specific threads. Our overhaul requires slight strengthenings of both the C11 and the OpenCL memory models, causing some behaviours to become disallowed. We argue that these strengthenings are natural, and that all of the formalised C11 and OpenCL compilation schemes of which we are aware (Power and x86 CPUs for C11, AMD GPUs for OpenCL) remain valid in our revised models. Using the HERD memory model simulator, we show that our overhaul leads to an exponential improvement in simulation time for C11 litmus tests compared with the original model, making exhaustive simulation competitive, time-wise, with the non-exhaustive CDSChecker tool.

关键词： Formal methods graphics processing unit (GPU) heterogeneous programming HOL theorem prover language design program simulation weak memory models

来源：评论

学校读者我要写书评

暂无评论

CLOP: A Multi-stage Compiler to Seamlessly Embed heterogeneous Code 2015

CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneo...

引用

14th ACM SIGPLAN International Conference on Generative-programming - Concepts and Experiences (GPCE) co-located with SPLASH Conference

作者： Makarov, Dmitri Hauswirth, Matthias Univ Svizzera Italiana Lugano Switzerland

ISBN: (纸本)9781450336871

heterogeneous programming complicates software development. We present CLOP, a platform that embeds code targeting heterogeneous compute devices in a convenient and clean way, allowing unobstructed data flow between the host code and the devices, reducing the amount of source code by an order of magnitude. The CLOP compiler uses the standard facilities of the D programming language to generate code strictly at compile-time. In this paper we describe the CLOP language and the CLOP compiler implementation.

关键词： heterogeneous programming Embedded Languages Staging

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：