Graphics Processing Units (gpu) offer tremendous computational power by following a throughput oriented paradigm where many thousand computational units operate in parallel. programming such massively parallel hardwar...
详细信息
Graphics Processing Units (gpu) offer tremendous computational power by following a throughput oriented paradigm where many thousand computational units operate in parallel. programming such massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream gpu programming languages, such as CUDA and OpenCL, are based on C/C++ inheriting their fundamentally unsafe ways to access memory via raw pointers. This facilitates easy to make, but hard to detect bugs, such as data races and deadlocks. In this paper, we present Descend: a safe gpu programming language. In contrast to prior safe high-level gpu programming approaches, Descend is an imperative gpu systems programming language in the spirit of Rust, enforcing safe CPU and gpu memory management in the type system by tracking Ownership and Lifetimes. Descend introduces a new holistic gpu programming model where computations are hierarchically scheduled over the gpu's execution resources: grid, blocks, warps, and threads. Descend's extended Borrow checking ensures that execution resources safely access memory regions without data races. For this, we introduced views describing safe parallel access patterns of memory regions, as well as atomic variables. For memory accesses that can't be checked by our type system, users can annotate limited code sections as unsafe. We discuss the memory safety guarantees offered by Descend and evaluate our implementation using multiple benchmarks, demonstrating that Descend is capable of expressing real-world gpu programs showing competitive performance compared to manually written CUDA programs lacking Descend's safety guarantees.
While parallel programming, particularly on graphics processing units (gpus), and numerical optimization hold immense potential to tackle real-world computational challenges across disciplines, their inherent complexi...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
While parallel programming, particularly on graphics processing units (gpus), and numerical optimization hold immense potential to tackle real-world computational challenges across disciplines, their inherent complexity and technical demands often act as daunting barriers to entry. This, unfortunately, limits accessibility and diversity within these crucial areas of computer science. To combat this challenge and ignite excitement among undergraduate learners, we developed an application-driven course, harnessing robotics as a lens to demystify the intricacies of these topics making them tangible and engaging. Our course's prerequisites are limited to the required undergraduate introductory core curriculum, opening doors for a wider range of students. Our course also features a large final-project component to connect theoretical learning to applied practice. In our first offering of the course we attracted 27 students without prior experience in these topics and found that an overwhelming majority of the students fell that they learned both technical and soft skills such that they felt prepared for future study in these fields.
SYCL is an open standard for targeting heterogeneous hardware from C++. In this work, we evaluate a SYCL implementation for a discontinuous Galerkin discretization of the 2D shallow water equations targeting CPUs, gpu...
详细信息
SYCL is an open standard for targeting heterogeneous hardware from C++. In this work, we evaluate a SYCL implementation for a discontinuous Galerkin discretization of the 2D shallow water equations targeting CPUs, gpus, and also FPGAs. The discretization uses polynomial orders zero to two on unstructured triangular meshes. Separating memory accesses from the numerical code allow us to optimize data accesses for the target architecture. A performance analysis shows good portability across x86 and ARM CPUs, gpus from different vendors, and even two variants of Intel Stratix 10 FPGAs. Measuring the energy to solution shows that gpus yield an up to 10x higher energy efficiency in terms of degrees of freedom per joule compared to CPUs. With custom designed caches, FPGAs offer a meaningful complement to the other architectures with particularly good computational performance on smaller meshes. FPGAs with High Bandwidth Memory are less affected by bandwidth issues and have similar energy efficiency as latest generation CPUs.
Objective: To perform the first known investigation of differences between real-time and offline B-mode and shortlag spatial coherence (SLSC) images when evaluating fluid or solid content in 60 hypoechoic breast masse...
详细信息
Objective: To perform the first known investigation of differences between real-time and offline B-mode and shortlag spatial coherence (SLSC) images when evaluating fluid or solid content in 60 hypoechoic breast masses. Methods: Real-time and retrospective (i.e., offline) reader studies were conducted with three board-certified breast radiologists, followed by objective, reader-independent discrimination using generalized contrast-to-noise ratio (gCNR). Results: The content of 12 fluid, solid and mixed (i.e., containing fluid and solid components) masses were uncertain when reading real-time B-mode images. With real-time and offline SLSC images, 15 and 5, respectively, aggregated solid and mixed masses (and no fluid masses) were uncertain. Therefore, with real-time SLSC imaging, uncertainty about solid masses increased relative to offline SLSC imaging, while uncertainty about fluid masses decreased relative to real-time B-mode imaging. When assessing real-time SLSC reader results, 100% (11/11) of solid masses with uncertain content were correctly classified with a gCNR<0.73 threshold applied to real-time SLSC images. The areas under receiver operator characteristic curves characterizing gCNR as an objective metric to discriminate complicated cysts from solid masses were 0.963 and 0.998 with real-time and offline SLSC images, respectively, which are both considered excellent for diagnostic testing. Conclusion: Results are promising to support real-time SLSC imaging and gCNR application to real-time SLSC images to enhance sensitivity and specificity, reduce reader variability, and mitigate uncertainty about fluid or solid content, particularly when distinguishing complicated cysts (which are benign) from hypoechoic solid masses (which could be cancerous).
Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and afforda...
详细信息
Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by gpus. However, efficient gpu utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a gpu-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.
Reinforcement Learning (RL) is employed to develop control techniques for manipulating acoustic cavitation bubbles. This paper presents a proof of concept in which an RL agent is trained to discover a policy that allo...
详细信息
Reinforcement Learning (RL) is employed to develop control techniques for manipulating acoustic cavitation bubbles. This paper presents a proof of concept in which an RL agent is trained to discover a policy that allows precise control of bubble positions within a dual-frequency standing acoustic wave field by adjusting the pressure amplitude values. The agent is rewarded for driving the bubble to a target position in the shortest possible time. The results demonstrate that the agent exploits the nonlinear behaviour of the bubble and, in specific cases, identifies solutions that cannot be addressed using the linear theory of the primary Bjerknes force. The RL agent performs well under domain randomization, indicating that the RL approach generalizes effectively and produces models robust against noise, which could arise in real-world applications.
The creation of maps through the application of techniques such as X-Ray Fluorescence (XRF), X-Ray Diffraction (XRD), Raman spectroscopy, multispectral analysis, ultraviolet (UV), and infrared (IR) radiation has becom...
详细信息
The creation of maps through the application of techniques such as X-Ray Fluorescence (XRF), X-Ray Diffraction (XRD), Raman spectroscopy, multispectral analysis, ultraviolet (UV), and infrared (IR) radiation has become critical in the domain of Cultural Heritage for the analysis of materials. Limitations in data acquisition, particularly with more cost-effective and accessible devices, often restrict measurements to a sparse number of locations. This necessitates the employment of interpolation methods to extrapolate values for the unmeasured positions within the image. This paper introduces XMapsLab a software solution designed to facilitate the generation and examination of maps. XMapsLab utilizes gpu-optimized versions of interpolation methods, achieving speed enhancements ranging from one to two orders of magnitude. Such improvements markedly expand the capabilities of professionals to engage with and analyze data in real-time. Additionally, the software incorporates various interpolation methods, enhancing the robustness of the results and bolstering the confidence of experts in their conclusions. The real-time functionality of XMapsLab enables the development of new procedures for exploring hypotheses regarding the presence and distribution of pigments. This includes the integration of boolean and numerical operations for map combination, allowing users to investigate hypotheses and conditions in a direct, potent, and intuitive manner. The software developed is freely available and open-source, underscoring our commitment to supporting the broader Cultural Heritage community. (c) 2025 The Author(s). Published by Elsevier Masson SAS. This is an open access article under the CC BY license (http://***/licenses/by/4.0/)
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient gpu deployment. Quantization not ...
详细信息
ISBN:
(纸本)9798400714436
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient gpu deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether gpu kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated with the popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
Currently, more than 25% of supercomputers employ gpus due to their massively parallel and power-efficient architectures. However, programminggpus effiently in a large scale system is a demanding task not only for co...
详细信息
ISBN:
(纸本)9781450372367
Currently, more than 25% of supercomputers employ gpus due to their massively parallel and power-efficient architectures. However, programminggpus effiently in a large scale system is a demanding task not only for computational scientists but also for programming experts as multi-gpu programming requires managing distinct address spaces, generating gpu-specific code and handling inter-device communication. To ease the programming effort, we propose a tiling-based high-level gpu programming model for structured grid problems. The model abstracts data decomposition, memory management and generation of gpu specific code, and hides all types of data transfer overheads. We demonstrate the effectiveness of the programming model on a heat simulation and a real-life cardiac modeling on a single gpu, on a single node with multiple-gpus and multiple-nodes with multiple-gpus. We also present performance comparisons under different hardware and software configurations. The results show that the programming model successfully overlaps communication and provides good speedup on 192 gpus.
Heterogeneous processing platforms combining in various architectures CPUs, gpus, and programmable logic, are continuously evolving providing at each generation higher theoretical levels of computing performance. Howe...
详细信息
ISBN:
(纸本)9781728180991
Heterogeneous processing platforms combining in various architectures CPUs, gpus, and programmable logic, are continuously evolving providing at each generation higher theoretical levels of computing performance. However, the challenge of how efficiently specify and explore the design space of applications executing on the different components of heterogeneous platforms remains an open problem and is the subject of many research efforts. The paper describes a dataflow based approach for the synthesis of applications to be executed on mixed CPU and gpu architectures. The new high-level approach consists of partitioning the application dataflow program written in RVC-CAL into CPU and gpu components, then on generating by automatic synthesis the C++ and CUDA programs that together implement the application executable. The design approach provides portability of applications on CPUs, gpus, and mixed CPU/gpu architectures as well as the possibility of exploring the design space of all partitioning options without the need of rewriting the application code. The paper describes the essential methodology features at the base of the synthesis of CPU/gpu code and reports some example design cases validating the correctness of the approach.
暂无评论