检索结果-内蒙古大学图书馆

Descend: A Safe gpu Systems programming Language

PROCEEDINGS OF THE ACM ON programming LANGUAGES-PACMPL 2024年第PLDI期8卷 841-864页

作者： Koepcke, Bastian Gorlatch, Sergei Steuwer, Michel Univ Munster Munster Germany Tech Univ Berlin Berlin Germany

Graphics Processing Units (gpu) offer tremendous computational power by following a throughput oriented paradigm where many thousand computational units operate in parallel. programming such massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream gpu programming languages, such as CUDA and OpenCL, are based on C/C++ inheriting their fundamentally unsafe ways to access memory via raw pointers. This facilitates easy to make, but hard to detect bugs, such as data races and deadlocks. In this paper, we present Descend: a safe gpu programming language. In contrast to prior safe high-level gpu programming approaches, Descend is an imperative gpu systems programming language in the spirit of Rust, enforcing safe CPU and gpu memory management in the type system by tracking Ownership and Lifetimes. Descend introduces a new holistic gpu programming model where computations are hierarchically scheduled over the gpu's execution resources: grid, blocks, warps, and threads. Descend's extended Borrow checking ensures that execution resources safely access memory regions without data races. For this, we introduced views describing safe parallel access patterns of memory regions, as well as atomic variables. For memory accesses that can't be checked by our type system, users can annotate limited code sections as unsafe. We discuss the memory safety guarantees offered by Descend and evaluate our implementation using multiple benchmarks, demonstrating that Descend is capable of expressing real-world gpu programs showing competitive performance compared to manually written CUDA programs lacking Descend's safety guarantees.

关键词： gpu programming language design memory safety type systems

来源：评论

学校读者我要写书评

暂无评论

Parallel Optimization for Robotics: An Undergraduate Introduction to gpu Parallel programming and Numerical Optimization Research

Parallel Optimization for Robotics: An Undergraduate Introdu...

引用

1st International Conference on Smart Energy Systems and Artificial Intelligence (SESAI)

作者： Plancher, Brian Columbia Univ Barnard Coll New York NY 10027 USA

ISBN: (纸本)9798350364613;9798350364606

While parallel programming, particularly on graphics processing units (gpus), and numerical optimization hold immense potential to tackle real-world computational challenges across disciplines, their inherent complexity and technical demands often act as daunting barriers to entry. This, unfortunately, limits accessibility and diversity within these crucial areas of computer science. To combat this challenge and ignite excitement among undergraduate learners, we developed an application-driven course, harnessing robotics as a lens to demystify the intricacies of these topics making them tangible and engaging. Our course's prerequisites are limited to the required undergraduate introductory core curriculum, opening doors for a wider range of students. Our course also features a large final-project component to connect theoretical learning to applied practice. In our first offering of the course we attracted 27 students without prior experience in these topics and found that an overwhelming majority of the students fell that they learned both technical and soft skills such that they felt prepared for future study in these fields.

关键词： CS Education gpu programming Numerical Optimization Parallel Computing Robotics Undergraduate Education

来源：评论

学校读者我要写书评

暂无评论

Analyzing performance portability for a SYCL implementation of the 2D shallow water equations

引用

JOURNAL OF SUPERCOMPUTING 2025年第6期81卷 1-38页

作者： Buettner, Markus Alt, Christoph Kenter, Tobias Koestler, Harald Plessl, Christian Aizinger, Vadym Univ Bayreuth Chair Sci Comp Univ Str 30 D-95447 Bayreuth Germany Paderborn Univ Paderborn Ctr Parallel Comp Warburger Str 100 D-33098 Paderborn Germany Friedrich Alexander Univ Erlangen Nurnberg Dept Comp Sci Cauerstr 11 D-91058 Erlangen Germany

SYCL is an open standard for targeting heterogeneous hardware from C++. In this work, we evaluate a SYCL implementation for a discontinuous Galerkin discretization of the 2D shallow water equations targeting CPUs, gpus, and also FPGAs. The discretization uses polynomial orders zero to two on unstructured triangular meshes. Separating memory accesses from the numerical code allow us to optimize data accesses for the target architecture. A performance analysis shows good portability across x86 and ARM CPUs, gpus from different vendors, and even two variants of Intel Stratix 10 FPGAs. Measuring the energy to solution shows that gpus yield an up to 10x higher energy efficiency in terms of degrees of freedom per joule compared to CPUs. With custom designed caches, FPGAs offer a meaningful complement to the other architectures with particularly good computational performance on smaller meshes. FPGAs with High Bandwidth Memory are less affected by bandwidth issues and have similar energy efficiency as latest generation CPUs.

关键词： Performance portability SYCL Shallow water equations Discontinous Galerkin method gpu programming FPGA

来源：评论

学校读者我要写书评

暂无评论

Comparative Assessment of Real-Time and Offline Short-Lag Spatial Coherence Imaging of Ultrasound Breast Masses

引用

ULTRASOUND IN MEDICINE AND BIOLOGY 2025年第6期51卷 941-950页

作者： Venkatayogi, Nethra Sharma, Arunima Ambinder, Emily B. Myers, Kelly S. Oluyemi, Eniola T. Mullen, Lisa A. Bell, Muyinatu A. Lediju Johns Hopkins Univ Dept Comp Sci Baltimore MD 21218 USA Johns Hopkins Univ Dept Elect & Comp Engn Baltimore MD 21218 USA Johns Hopkins Med Dept Radiol & Radiol Sci Baltimore MD USA Johns Hopkins Univ Dept Biomed Engn Baltimore MD 21218 USA

Objective: To perform the first known investigation of differences between real-time and offline B-mode and shortlag spatial coherence (SLSC) images when evaluating fluid or solid content in 60 hypoechoic breast masses. Methods: Real-time and retrospective (i.e., offline) reader studies were conducted with three board-certified breast radiologists, followed by objective, reader-independent discrimination using generalized contrast-to-noise ratio (gCNR). Results: The content of 12 fluid, solid and mixed (i.e., containing fluid and solid components) masses were uncertain when reading real-time B-mode images. With real-time and offline SLSC images, 15 and 5, respectively, aggregated solid and mixed masses (and no fluid masses) were uncertain. Therefore, with real-time SLSC imaging, uncertainty about solid masses increased relative to offline SLSC imaging, while uncertainty about fluid masses decreased relative to real-time B-mode imaging. When assessing real-time SLSC reader results, 100% (11/11) of solid masses with uncertain content were correctly classified with a gCNR<0.73 threshold applied to real-time SLSC images. The areas under receiver operator characteristic curves characterizing gCNR as an objective metric to discriminate complicated cysts from solid masses were 0.963 and 0.998 with real-time and offline SLSC images, respectively, which are both considered excellent for diagnostic testing. Conclusion: Results are promising to support real-time SLSC imaging and gCNR application to real-time SLSC images to enhance sensitivity and specificity, reduce reader variability, and mitigate uncertainty about fluid or solid content, particularly when distinguishing complicated cysts (which are benign) from hypoechoic solid masses (which could be cancerous).

关键词： Breast ultrasound Coherence-based beamforming Breast cancer Complicated cysts gpu programming Generalized contrast-to-noise ratio

来源：评论

学校读者我要写书评

暂无评论

Self-Adaptive Micro-Batching for Low-Latency gpu-Accelerated Stream Processing

引用

INTERNATIONAL JOURNAL OF PARALLEL programming 2025年第2期53卷 1-24页

作者： Leonarczyk, Ricardo Mencagli, Gabriele Griebler, Dalvan Pontif Catholic Univ Rio Grande Do Sul PUCRS Sch Technol Ave Ipiranga 6681 BR-90619900 Porto Alegre RS Brazil Univ Pisa Comp Sci Dept Largo Bruno Pontecorvo 3 I-56127 Pisa Italy

Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by gpus. However, efficient gpu utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a gpu-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.

关键词： Parallel programming Heterogeneous architectures Self-adaptive algorithms gpu programming

来源：评论

学校读者我要写书评

暂无评论

Position control of an acoustic cavitation bubble by reinforcement learning

引用

ULTRASONICS SONOCHEMISTRY 2025年 115卷 107290页

作者： Klapcsik, Kalman Gyires-Toth, Balint Rossello, Juan Manuel Hegedus, Ferenc Budapest Univ Technol & Econ Fac Mech Engn Dept Hydrodynam Syst Egyet Rkp 3 H-1111 Budapest Hungary Budapest Univ Technol & Econ Fac Elect Engn & Informat Dept Telecommun & Media Informat Egyet Rkp 3 H-1111 Budapest Hungary Univ Ljubljana Fac Mech Engn Askerceva 6 Ljubljana 1000 Slovenia

Reinforcement Learning (RL) is employed to develop control techniques for manipulating acoustic cavitation bubbles. This paper presents a proof of concept in which an RL agent is trained to discover a policy that allows precise control of bubble positions within a dual-frequency standing acoustic wave field by adjusting the pressure amplitude values. The agent is rewarded for driving the bubble to a target position in the shortest possible time. The results demonstrate that the agent exploits the nonlinear behaviour of the bubble and, in specific cases, identifies solutions that cannot be addressed using the linear theory of the primary Bjerknes force. The RL agent performs well under domain randomization, indicating that the RL approach generalizes effectively and produces models robust against noise, which could arise in real-world applications.

关键词： Bubble position control Reinforcement learning Bubble dynamics gpu programming

来源：评论

学校读者我要写书评

暂无评论

XMapsLab: A program for the creation and study of maps for Cultural Heritage

引用

JOURNAL OF CULTURAL HERITAGE 2025年 73卷 1-10页

作者： Martin, Domingo Arroyo, German de Miras, Juan Ruiz Lopez, Luis Blanc, Maria Rosario Vilchez, Jose Luis Sarrazin, Philippe Torres, Juan Carlos Univ Granada Dept Software Engn Granada Spain Univ Granada Dept Analyt Chem Granada Spain eXaminart Mountain View CA USA

The creation of maps through the application of techniques such as X-Ray Fluorescence (XRF), X-Ray Diffraction (XRD), Raman spectroscopy, multispectral analysis, ultraviolet (UV), and infrared (IR) radiation has become critical in the domain of Cultural Heritage for the analysis of materials. Limitations in data acquisition, particularly with more cost-effective and accessible devices, often restrict measurements to a sparse number of locations. This necessitates the employment of interpolation methods to extrapolate values for the unmeasured positions within the image. This paper introduces XMapsLab a software solution designed to facilitate the generation and examination of maps. XMapsLab utilizes gpu-optimized versions of interpolation methods, achieving speed enhancements ranging from one to two orders of magnitude. Such improvements markedly expand the capabilities of professionals to engage with and analyze data in real-time. Additionally, the software incorporates various interpolation methods, enhancing the robustness of the results and bolstering the confidence of experts in their conclusions. The real-time functionality of XMapsLab enables the development of new procedures for exploring hypotheses regarding the presence and distribution of pigments. This includes the integration of boolean and numerical operations for map combination, allowing users to investigate hypotheses and conditions in a direct, potent, and intuitive manner. The software developed is freely available and open-source, underscoring our commitment to supporting the broader Cultural Heritage community. (c) 2025 The Author(s). Published by Elsevier Masson SAS. This is an open access article under the CC BY license (http://***/licenses/by/4.0/)

关键词： XRF mapping Interpolation Data analysis Painting study gpu programming

来源：评论

学校读者我要写书评

暂无评论

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models 25

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference o...

引用

30th Symposium on Principles and Practice of Parallel programming

作者： Frantar, Elias Castro, Roberto L. Chen, Jiale Hoefler, Torsten Alistarh, Dan IST Austria Klosterneuburg Austria Univ A Coruna CITIC La Coruna Spain Swiss Fed Inst Technol Zurich Switzerland Neural Mag Inc Somerville NJ USA

ISBN: (纸本)9798400714436

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient gpu deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether gpu kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated with the popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.

关键词： Large language model (LLM) inference gpu programming Batch parallelism

来源：评论

学校读者我要写书评

暂无评论

Tiling-Based programming Model for Structured Grids on gpu Clusters 2020

Tiling-Based Programming Model for Structured Grids on GPU C...

引用

International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia)

作者： Bastem, Burak Unat, Didem Koc Univ Istanbul Turkey

ISBN: (纸本)9781450372367

Currently, more than 25% of supercomputers employ gpus due to their massively parallel and power-efficient architectures. However, programming gpus effiently in a large scale system is a demanding task not only for computational scientists but also for programming experts as multi-gpu programming requires managing distinct address spaces, generating gpu-specific code and handling inter-device communication. To ease the programming effort, we propose a tiling-based high-level gpu programming model for structured grid problems. The model abstracts data decomposition, memory management and generation of gpu specific code, and hides all types of data transfer overheads. We demonstrate the effectiveness of the programming model on a heat simulation and a real-life cardiac modeling on a single gpu, on a single node with multiple-gpus and multiple-nodes with multiple-gpus. We also present performance comparisons under different hardware and software configurations. The results show that the programming model successfully overlaps communication and provides good speedup on 192 gpus.

关键词： gpu programming gpu cluster multi-gpu tiling communication verlap gpu streams

来源：评论

学校读者我要写书评

暂无评论

programming heterogeneous CPU-gpu systems by high-level dataflow synthesis 34

Programming heterogeneous CPU-GPU systems by high-level data...

引用

34th IEEE Workshop on Signal Processing Systems (SiPS)

作者： Bloch, Aurelien Bezati, Endri Mattavelli, Marco Ecole Polytech Fed Lausanne SCI STI MM Lausanne Switzerland Ecole Polytech Fed Lausanne VLSC Lausanne Switzerland

ISBN: (纸本)9781728180991

Heterogeneous processing platforms combining in various architectures CPUs, gpus, and programmable logic, are continuously evolving providing at each generation higher theoretical levels of computing performance. However, the challenge of how efficiently specify and explore the design space of applications executing on the different components of heterogeneous platforms remains an open problem and is the subject of many research efforts. The paper describes a dataflow based approach for the synthesis of applications to be executed on mixed CPU and gpu architectures. The new high-level approach consists of partitioning the application dataflow program written in RVC-CAL into CPU and gpu components, then on generating by automatic synthesis the C++ and CUDA programs that together implement the application executable. The design approach provides portability of applications on CPUs, gpus, and mixed CPU/gpu architectures as well as the possibility of exploring the design space of all partitioning options without the need of rewriting the application code. The paper describes the essential methodology features at the base of the synthesis of CPU/gpu code and reports some example design cases validating the correctness of the approach.

关键词： dynamic dataflow programs RVC-CAL parallel computing source-to-source compiler gpu programming heterogeneous systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：