检索结果-内蒙古大学图书馆

12th International Workshop on OpenCL and SYCL, IWOCL 2024

作者： Homerding, Brian Vargas, Arturo Scogland, Tom Chen, Robert Davis, Mike Hornung, Rich Argonne National Laboratory United States Lawrence Livermore National Laboratory United States

ISBN: (纸本)9798400717901

To achieve high performance, modern HPC systems take advantage of heterogeneous GPU architectures. Often these GPUs are programmed using a vendor preferred parallel programming model. Unfortunately, this often results in application code that is not portable across vendors. To address this issue, open parallel programming models have been introduced. One such parallel programming model is provided by the RAJA Portability Suite. RAJA is a portability layer that provides an abstract application developer API as a library through modern C++. In RAJA, computational kernels are lowered down to a backend language appropriate for the target architecture. Additionally, RAJA is designed to provide control to the application developer over kernel execution while minimizing modification to the application source code. In this work, we describe the SYCL backend implementation in RAJA for Intel GPU execution. We discuss the implementation of various features in the SYCL backend along with the challenges and lessons learned. Finally, we investigate the performance impact of executing several HPC kernels through RAJA when compared to direct SYCL implementations. © 2024 ACM.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

EA: Research-Infused Teaching of parallel programming Concepts for Undergraduate Software Engineering Students

EA: Research-Infused Teaching of Parallel Programming Concep...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Nasser Giacaman Oliver Sinnen Department of Electrical and Computer Engineering The University of Auckland Auckland New Zealand

ISBN: (纸本)9781479941155

This paper presents experience using a research-infused teaching approach towards an undergraduate parallel programming course. The research-teaching nexus is applied at various levels, first by using research-led teaching of core parallel programming concepts, as well as teaching the latest developments from the affiliated research group. The bulk of the course, however, focuses more on the student-driven research-based and research-tutored teaching approaches, where students actively participate in groups on research projects, students are fully immersed in the learning activity of their respective project, while at the same time participating in discussions of wider parallel programming topics across other groups. This intimate affiliation between the undergraduate course and the research group results in a wide range of benefits for all those involved.

关键词： Education parallel programming Software engineering Java Androids Humanoid robots Software

来源：评论

学校读者我要写书评

暂无评论

Taskflow: A Lightweight parallel and Heterogeneous Task Graph Computing System

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2022年第6期33卷 1303-1320页

作者： Huang, Tsung-Wei Lin, Dian-Lun Lin, Chun-Xun Lin, Yibo Univ Utah Dept Elect & Comp Engn Salt Lake City UT 84112 USA MathWorks Natick MA 01760 USA Peking Univ Dept Comp Sci Beijing 100871 Peoples R China

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a very general class of task graph parallelism with in-graph control flow to enable end-to-end parallel optimization. To support our model with high performance, we design an efficient system runtime that solves many of the new scheduling challenges arising out of our models and optimizes the performance across latency, energy efficiency, and throughput. We have demonstrated the promising performance of Taskflow in real-world applications. As an example, Taskflow solves a large-scale machine learning workload up to 29% faster, 1.5x less memory, and 1.9x higher throughput than the industrial system, oneTBB, on a machine of 40 CPUs and 4 GPUs. We have opened the source of Taskflow and deployed it to large numbers of users in the open-source community.

关键词： Task analysis parallel processing Graphics processing units Computational modeling programming Solid modeling Runtime parallel programming task parallelism high-performance computing modern C plus plus programming

来源：评论

学校读者我要写书评

暂无评论

GPU-parallelized Simulation of Optical Forces on Nanoparticles in a Fluid Medium

GPU-Parallelized Simulation of Optical Forces on Nanoparticl...

引用

37th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Fey, Florian Gerwing, Alexander Gorlatch, Sergei Univ Munster Dept Comp Sci Munster Germany

ISBN: (纸本)9798350311990

Experimental research in physics can be a costly and time-consuming venture, requiring simulation-based approaches to effectively narrow down the scope of experiments to only the most promising cases. Our multidisciplinary research in this paper demonstrates how the simulation of light-driven nanoparticles can substantially benefit from GPU-based parallelism. We develop a novel ray-tracing strategy and we implement it in C++/CUDA and extend it with a parallel differential equation solver. Our implementation relies on a custom memory layout optimization to tackle the computational challenges in the field and provide accurate solutions in near real time. We evaluate our approach on a variety of popular GPU architectures, including advanced data-center GPUs like the Nvidia V100, as well as consumer-grade hardware like the Nvidia RTX 2080 Ti and Nvidia GTX 1080. Our GPU-based approach achieves a speedup of up to 20x compared to a parallel CPU-based prototype implementation.

关键词： high-performance computing parallel programming GPU ray tracing numerical simulation physical optics

来源：评论

学校读者我要写书评

暂无评论

Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2022年第2期33卷 429-443页

作者： Ho, Nhut-Minh Wong, Weng-Fai Natl Univ Singapore Dept Comp Sci Singapore 119077 Singapore

Driven by the demands of deep learning, many hardware accelerators, including GPUs, have begun to include specialized tensor processing units to accelerate matrix operations. However, general-purpose GPU applications that have little or no large dense matrix operations cannot benefit from these tensor units. This article proposes Tensorox, a framework that exploits the half-precision tensor cores available on recent GPUs for approximable, non deep learning applications. In essence, a shallow neural network is trained based on the input-output mapping of the function to be approximated. The key innovation in our implementation is the use of the small and dimension-restricted tensor operations in Nvidia GPUs to run multiple instances of the approximation neural network in parallel. With the proper scaling and training methods, our approximation yielded an overall accuracy that is higher than naively running the original programs with half-precision. Furthermore, Tensorox allows for the runtime adjustment of the degree of approximation. For the 10 benchmarks we tested, we achieved speedups from 2x to 112x compared to the original in single precision floating point, while maintaining the error caused by the approximation to below 10 percent in most applications.

关键词： Hardware Tensors Neural networks Deep learning Graphics processing units Task analysis Training Graphics processing units parallel programming approximate computing neural networks tensor processing unit GPGPU

来源：评论

学校读者我要写书评

暂无评论

DEMAC: A Platform for Education in High-performance Computing,Bridging the Gap Between Users and Hardware

DEMAC: A Platform for Education in High-performance Computin...

引用

Workshop on Computer Architecture Education (WCAE)

作者： Perdomo, Diego A. Roa Kinsley, Paige C. Diaz, Jose M. Monsalve Papka, Michael E. Li, Xiaoming Argonne Natl Lab Lemont IL 60439 USA Univ Delaware CAPSL Newark DE 19716 USA Univ Delaware Newark DE USA

ISBN: (纸本)9798400702532

Scientists, engineers, and researchers leverage high-performance computing (HPC) systems to perform complex computations and process large amounts of data. Designing, developing, and operating HPC systems have a steep learning curve, thus making it crucial to train a highly skilled and knowledgeable workforce in order to keep up with the rapidly evolving field, drive innovation, and meet the increasing demand for HPC across various sectors. Limited access to HPC educational resources is the main deterrent to training HPC talent. This paper addresses two primary culprits for the limited access: the high cost of production systems and the lack of realistic full-stack HPC training. Cutting-edge hardware is usually expensive and requires specialized facilities. Moreover, large HPC facilities typically discourage experimenting with the systems since they run production computation workloads and require minimal disturbance. Furthermore, HPC training often does not reflect the scale or complexity of production systems. This lack of realistic training support makes education in this area particularly difficult and ineffective. This paper proposes an educational framework for HPC that includes the development of a low-cost and flexible platform design for users in diverse fields. It allows study and experimentation with multiple realistic elements involved in a production HPC ecosystem. DEMAC, the Delaware Modular Assembly Cluster, is a set of 3D-printable frames designed to house embedded systems and auxiliary systems in a way that emulates HPC platforms. The teaching framework focuses on practical training as an education model in which learners reinforce theoretical knowledge with hands-on experience. If successful, this effort will contribute fundamentally to scientific research, technological advancements, HPC workforce development, and economic growth.

关键词： High-performance computing education programming execution model codelet model cluster dataflow parallel programming distributed computing

来源：评论

学校读者我要写书评

暂无评论

Sensor Tracking System using Radar and Object Detection 13

Sensor Tracking System using Radar and Object Detection

引用

IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)

作者： Melgoza, Cesar Martinez George, Kiran Miho, Jake Calif State Univ Fullerton Comp Engn Fullerton CA 92634 USA

ISBN: (纸本)9798350332865

Sensor technology has impacted and transformed several commercial and military industries, from automotive to defense, to personal security applications. Adopting Computer Vision techniques such as Object Detection helps boost the system's capabilities in tracking and classification. A previous study looked at optimizing a Radar System that relied on the processing power of the Raspberry Pi 4. Since the Raspberry Pi 4 is not an ideal device for a real-time environment, this experiment aims to improve on that aspect by utilizing the processing power of the Jetson TX2 to boost the system's speed and reduce latency between the Object Detection and Velocity calculation. The experiment will be repeated using the Jetson hardware to mimic the same scenarios in common tracking and surveillance systems.

关键词： Radar Transceiver HB100 Jetson TX2 parallel programming Computer Vision YOLOv3 Object Detection System Latency Average Classification Accuracy

来源：评论

学校读者我要写书评

暂无评论

Safety Hints for HTM Capacity Abort Mitigation 29

Safety Hints for HTM Capacity Abort Mitigation

引用

29th IEEE International Symposium on High-Performance Computer Architecture (HPCA)

作者： Jain, Anirudh Kadiyala, Divya Kiran Daglis, Alexandros Georgia Inst Technol Sch Comp Sci Atlanta GA 30332 USA Georgia Inst Technol Sch Elect & Comp Engn Atlanta GA USA

ISBN: (纸本)9781665476522

Hardware Transactional Memory (HTM) is a high-performance instantiation of the powerful programming abstraction of transactional memory, which simplifies the daunting-yet critically important-task of parallel programming. While many HTM implementations with variable complexity exist in the literature, commercially available HTMs impose rigid restrictions to transaction and system behavior, limiting their practical use. A key constraint is the limited size of supported transactions, implicitly capped by hardware buffering capacity. We identify the opportunity to expand the effective capacity of these limited hardware structures by being more selective in memory accesses that need to be tracked. We leverage compiler and virtual memory support to identify safe memory accesses, which can never cause a transaction abort, subsequently passed as safety hints to the underlying HTM. With minor extensions over a conventional HTM implementation, HinTM uses these hints to selectively allocate transactional state tracking resources to unsafe accesses only, thus expanding the HTM's effective capacity, and conversely reducing capacity aborts. We demonstrate that HinTM effectively augments the performance of a range of baseline HTM configurations. When coupled with a POWER8 HTM implementation, HinTM eliminates 64% of transactional capacity aborts, achieving 1.4x average speedup, and up to 8.7x.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

The Kokkos OpenMPTarget Backend: Implementation and Lessons Learned 19th

The Kokkos OpenMPTarget Backend: Implementation and Lessons ...

引用

19th International Workshop on OpenMP (IWOMP)

作者： Gayatri, Rahulkumar Olivier, Stephen L. Trott, Christian R. Doerfert, Johannes Ciesko, Jan Lebrun-Grandie, Damien Lawrence Berkeley Natl Lab Berkeley CA 94720 USA Sandia Natl Labs POB 5800 Albuquerque NM 87185 USA Lawrence Livermore Natl Lab Livermore CA 94550 USA Oak Ridge Natl Lab Oak Ridge TN USA

ISBN: (纸本)9783031407437;9783031407444

As the supercomputing landscape diversifies, solutions such as Kokkos to write vendor agnostic applications and libraries have risen in popularity. Kokkos provides a programming model designed for performance portability, which allows developers to write a single source implementation that can run efficiently on various architectures. At its heart, Kokkos maps parallel algorithms to architecture and vendor specific backends written in lower level programming models such as CUDA and HIP. Another approach to writing vendor agnostic parallel code is using OpenMP's directives based approach, which lets developers annotate code to express parallelism. It is implemented at the compiler level and is supported by all major high performance computing vendors, as well as the primary Open Source toolchains GNU and LLVM. Since its inception, Kokkos has used OpenMP to parallelize on CPU architectures. In this paper, we explore leveraging OpenMP for a GPU backend and discuss the challenges we encountered when mapping the Kokkos APIs and semantics to OpenMP target constructs. As an exemplar workload we chose a simple conjugate gradient solver for sparse matrices. We find that performance on NVIDIA and AMD GPUs varies widely based on details of the implementation strategy and the chosen compiler. Furthermore, the performance of the OpenMP implementations decreases with increasing complexity of the investigated algorithms.

关键词： Kokkos OpenMP GPUs parallel programming performance portability

来源：评论

学校读者我要写书评

暂无评论

Shape-Constrained Array programming with Size-Dependent Types 11

Shape-Constrained Array Programming with Size-Dependent Type...

引用

11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing (FHPNC)

作者： Bailly, Lubin Henriksen, Troels Elsman, Martin Ecole Normale Super PSL Dept Informat ENS Paris France Univ Copenhagen DIKU Copenhagen Denmark

ISBN: (纸本)9798400702969

We present a dependent type system for enforcing array-size consistency in an ML-style functional array language. Our goal is to enforce shape-consistency at compile time and allow nontrivial transformations on array shapes, without the complexity such features tend to introduce in dependently typed languages. Sizes can be arbitrary expressions and size equality is purely syntactical, which fits naturally within a scheme that interprets size-polymorphic functions as having implicit arguments. When non-syntactical equalities are needed, we provide dynamic checking. In contrast to other dependently typed languages, we automate the book-keeping involved in tracking existential sizes, such as when filtering arrays. We formalise a large subset of the presented type system and prove it sound. We also discuss how to adapt the type system for a real implementation, including type inference, within the Futhark programming language.

关键词： type systems parallel programming functional programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：