检索结果-内蒙古大学图书馆

Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments

APPLIED SCIENCES-BASEL 2025年第4期15卷 1834-1834页

作者： Polak, Md. Sabbir Hossain Troendle, David A. Jang, Byunghyun Univ Mississippi Comp & Informat Sci 201 Weir Hall Oxford MS 38677 USA

This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ's design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue's state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2x while preserving FIFO semantics. The paper demonstrates BACQ's superior performance through real-world empirical evaluations.

关键词： GPGPU concurrent queues concurrent data structures concurrent programming parallel programming strict First-In-First-Out (FIFO)

来源：评论

学校读者我要写书评

暂无评论

A Three-Semester, Interdisciplinary Approach to parallel programming in a Liberal Arts University Setting 14

A Three-Semester, Interdisciplinary Approach to Parallel Pro...

引用

2014 Annual Conference on Extreme Science and Engineering Discovery Environment, XSEDE 2014

作者： Morris, Mike Frinkle, Karl Department of Computer Science Southeastern Oklahoma State University Durant OK USA United States Department of Mathematics Southeastern Oklahoma State University Durant OK USA United States

ISBN: (纸本)9781450328937

We describe a successful addition of high performance computing (HPC) into a traditional computer science curriculum at a liberal arts university. The approach incorporated a three-semester sequence of courses emphasizing parallel programming techniques, with the final course focusing on a research-level mathematical project that was executed on a TOP500 supercomputer. A group of students with varied programming backgrounds participated in the program. Emphasis was placed on utilizing the Open MPI and CUDA libraries along with parallel algorithm and file I/O analysis. Copyright 2014 ACM.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

引用

SOFTWARE-PRACTICE & EXPERIENCE 2025年第6期55卷 1127-1141页

作者： Algis, David Bramas, Berenger Darles, Emmanuelle Aveneau, Lilian Univ Poitiers XLIM Poitiers France Studio Nyx Gond Pontouvre France INRIA Nancy Grand Est ICube Nancy France

IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support and incomplete mathematical *** address these limitations, GPU manufacturers have developed specialized programming models, such as CUDA and HIP, which enable developers to leverage the full computational power of modern GPUs. This article introduces an open-source tool designed to bridge the gap between Unity and CUDA, allowing developers to integrate CUDA's capabilities within Unity-based *** proposed solution establishes an interoperability framework that facilitates communication between Unity and CUDA. The tool is designed to efficiently transfer data, execute CUDA kernels, and retrieve results, ensuring seamless integration into Unity's rendering and computation *** tool extends Unity's capabilities by enabling CUDA-based computations, overcoming the inherent limitations of Unity's compute shader language. This integration allows developers to exploit multi-GPU architectures, leverage advanced mathematical functions, and enhance computational performance for real-time applications.

关键词： CUDA interoperability parallel programming programming techniques real-time systems software tools unity

来源：评论

学校读者我要写书评

暂无评论

Approach class library of high level parallel compositions to implements communication patterns using structured parallel programming 26

Approach class library of high level parallel compositions t...

引用

26th European Modeling and Simulation Symposium, EMSS 2014

作者： Rossainz-López, M. Capel-Tuñón, M.I. Universidad Autónoma de Puebla Avenida. San Claudio y 14 Sur San Manuel Puebla State of Puebla72000 Mexico Departamento de Lenguajes y Sistemas Informáticos ETS Ingeniería Informática y de Telecomunicación Universidad de Granada Periodista Daniel Saucedo Aranda s/n Granada18071 Spain

ISBN: (纸本)9788897999324

This article presents through an environment of parallel Objects, an approach to Structured parallel programming and the Object-Orientation paradigm, a programming methodology based on High Level parallel Compositions (HLPC). By means of the method application, the parallelization of commonly used communication patterns among processes is presented, which is initially constituted by the HLPCs Farm, Pipe and TreeDV that represent, respectively, the patterns of communication Farm, Pipeline and Binary Tree, the latter one used within a parallel version of the design technique known as Divide and Conquer.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Optimizing maximum shared risk link group disjoint path algorithm using NVIDIA CUDA heterogeneous parallel programming platform 10

Optimizing maximum shared risk link group disjoint path algo...

引用

10th International Symposium on Telecommunications, BIHTEL 2014

作者： Miletic, Vedran Subic, Tomislav Mikac, Branko University of Rijeka Department of Informatics Radmile Matejčić 2 Rijeka51000 Croatia University of Zagreb Faculty of Electrical Engineering and Computing Unska 3 Zagreb10000 Croatia

ISBN: (纸本)9781479941360

Network availability is an essential feature of an optical telecommunication network. Should a failure of a network component occur, be it a link or a component inside a node, network control plane must be able to detect the failure and reroute the traffic using spare components until a repair is done. Shared risk link groups (SRLGs) are used to describe a situation where seemingly unrelated logical failures happen due to a single physical failure. For example, two or more links might share a bridge crossing;should a failure happen, all of them will be damaged. Routing algorithms were proposed to ensure working and spare paths of a connection in a network are SRLG-disjoint to avoid such common cause failures. However, complete SRLG-disjointness of working and spare path is not always possible due to limited number of links or limited capacity available in the network, so maximum SRLG-disjoint paths algorithm is taken instead. Maximum SRLG-disjoint path problem is in general NP-hard. In terms of solution quality greedy algorithms for maximum SRLG-disjoint path problem are as good as more complicated heuristics. To improve the performance of maximum SRLG-disjoint path greedy algorithm, it was implemented using NVIDIA CUDA heterogeneous parallel programming platform and executed on graphics processing unit. The implementation of maximum SRLG-disjoint path algorithm on GPU increases performance significantly compared to implementation utilizing only CPU, especially in simulations of large networks. © 2014 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Freeze after writing quasi-deterministic parallel programming with LVars

Freeze after writing quasi-deterministic parallel programmin...

引用

作者： Kuper, Lindsey Turon, Aaron Krishnaswami, Neelakantan R. Newton, Ryan R. Indiana University United States MPI-SWS Germany University of Birmingham United Kingdom

Deterministic-by-construction parallel programming models offer the advantages of parallel speedup while avoiding the nondeterministic, hard-to-reproduce bugs that plague fully concurrent code. A principled approach to deterministic-by-construction parallel programming with shared state is offered by LVars: shared memory locations whose semantics are defined in terms of an applicationspecific lattice. Writes to an LVar take the least upper bound of the old and new values with respect to the lattice, while reads from an LVar can observe only that its contents have crossed a specified threshold in the lattice. Although it guarantees determinism, this interface is quite limited. We extend LVars in two ways. First, we add the ability to freeze and then read the contents of an LVar directly. Second, we add the ability to attach event handlers to an LVar, triggering a callback when the LVar's value changes. Together, handlers and freezing enable an expressive and useful style of parallel programming. We prove that in a language where communication takes place through these extended LVars, programs are at worst quasideterministic: on every run, they either produce the same answer or raise an error. We demonstrate the viability of our approach by implementing a library for Haskell supporting a variety of LVarbased data structures, together with a case study that illustrates the programming model and yields promising parallel speedup.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Self-adaptive parallel programming through tunable concurrency

Self-adaptive parallel programming through tunable concurren...

引用

2014 ACM SIGPLAN Conference on Systems, programming, and Applications: Software for Humanity, SPLASH 2014

作者： Nguyen, Tai Zhao, Xinghui School of Engineering and Computer Science Washington State University United States

ISBN: (纸本)9781450332088

Recent advances in hardware architectures, particularly multicore and manycore systems, implicitly require programmers to write concurrent programs. However, writing correct and efficient concurrent programs is challenging. We envision a system where the concurrent programs can be self-adaptive when executing on different hardware. We have developed two different tuning policies, which enable users' programs to adjust their level of concurrency at compiletime and run-time respectively. Copyright is held by the owner/author(s).

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Enabling HW-Based Task Scheduling in Large Multicore Architectures

引用

IEEE TRANSACTIONS ON COMPUTERS 2024年第1期73卷 138-151页

作者： Morais, Lucas Alvarez, Carlos Jimenez-Gonzalez, Daniel de Haro, Juan Miguel Araujo, Guido Frank, Michael Goldman, Alfredo Martorell, Xavier Barcelona Supercomp Ctr Barcelona 08034 Spain Univ Politecn Cataluna Barcelona 08034 Spain Univ Estadual Campinas BR-13083852 Campinas SP Brazil Arteris Inc Campbell CA 95008 USA Univ Sao Paulo BR-05508090 Sao Paulo Brazil

Dynamic Task Scheduling is an enticing programming model aiming to ease the development of parallel programs with intrinsically irregular or data-dependent parallelism. The performance of such solutions relies on the ability of the Task Scheduling HW/SW stack to efficiently evaluate dependencies at runtime and schedule work to available cores. Traditional SW-only systems implicate scheduling overheads of around 30K processor cycles per task, which severely limit the (core count, task granularity) combinations that they might adequately handle. Previous work on HW-accelerated Task Scheduling has shown that such systems might support high performance scheduling on processors with up to eight cores, but questions remained regarding the viability of such solutions to support the greater number of cores now frequently found in high-end SMP systems. The present work presents an FPGA-proven, tightly-integrated, Linux-capable, 30-core RISC-V system with hardware accelerated Task Scheduling. We use this implementation to show that HW Task Scheduling can still offer competitive performance at such high core count, and describe how this organization includes hardware and software optimizations that make it even more scalable than previous solutions. Finally, we outline ways in which this architecture could be augmented to overcome inter-core communication bottlenecks, mitigating the cache-degradation effects usually involved in the parallelization of highly optimized serial code.

关键词： parallel programming hardware acceleration Task Scheduling RISC-V custom ISA FPGA

来源：评论

学校读者我要写书评

暂无评论

Load balanced sub-tree decomposition algorithm for solving Mixed Integer Linear programming models in behavioral synthesis

引用

COMPUTERS & ELECTRICAL ENGINEERING 2025年 123卷

作者： Fazlali, Mahmood Mirhosseini, Mina Moghaddam, Mahdi Movahedian Timarchi, Somayyeh Univ Hertfordshire Sch Phys Engn & Comp Sci Cybersecur & Comp Syst Res Grp Hatfield Herts England Shahid Beheshti Univ Fac Math Sci Dept Data & Comp Sci Tehran Iran Molde Univ Coll Fac Logest Molde Norway Queen Mary Univ London Sch Elect Engn & Comp Sci Antenna & Electromagnet Res Grp London E14NS England

Mixed Integer Linear programming (MILP) is utilized in behavioral synthesis as a mathematical model to design efficient hardware. However, solving large MILP models poses significant computational challenges due to their NP-hard nature. paralleling can tackle this challenge by amortizing the execution time, yet unbalanced loads can hinder its effectiveness. In this paper, we address the load balance issue of parallel Branch and Bound (B&B) algorithms, particularly sub-tree parallelism, which exhibit efficiency in solving MILP models derived from behavioral synthesis. The proposed algorithm strategically partitions the original problem into sub-problems by selecting decision variables that appear in a higher number of constraints to prioritize load balance and enhance solver performance. We evaluate the effectiveness of our method using MILP models derived from Mediabench data flow graphs of various sizes. The experimental results indicate that the proposed algorithm achieves speedups ranging from approximately 1 to 13 times, highlighting its efficacy in improving the scalability and efficiency of MILP solving for behavioral synthesis.

关键词： parallel programming MILP Load balancing Partitioning

来源：评论

学校读者我要写书评

暂无评论

Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing

引用

INTERNATIONAL JOURNAL OF parallel programming 2025年第2期53卷 1-24页

作者： Leonarczyk, Ricardo Mencagli, Gabriele Griebler, Dalvan Pontif Catholic Univ Rio Grande Do Sul PUCRS Sch Technol Ave Ipiranga 6681 BR-90619900 Porto Alegre RS Brazil Univ Pisa Comp Sci Dept Largo Bruno Pontecorvo 3 I-56127 Pisa Italy

Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.

关键词： parallel programming Heterogeneous architectures Self-adaptive algorithms GPU programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：