This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, lev...
详细信息
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ's design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue's state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2x while preserving FIFO semantics. The paper demonstrates BACQ's superior performance through real-world empirical evaluations.
We describe a successful addition of high performance computing (HPC) into a traditional computer science curriculum at a liberal arts university. The approach incorporated a three-semester sequence of courses emphasi...
详细信息
ISBN:
(纸本)9781450328937
We describe a successful addition of high performance computing (HPC) into a traditional computer science curriculum at a liberal arts university. The approach incorporated a three-semester sequence of courses emphasizing parallel programming techniques, with the final course focusing on a research-level mathematical project that was executed on a TOP500 supercomputer. A group of students with varied programming backgrounds participated in the program. Emphasis was placed on utilizing the Open MPI and CUDA libraries along with parallel algorithm and file I/O analysis. Copyright 2014 ACM.
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU ...
详细信息
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support and incomplete mathematical *** address these limitations, GPU manufacturers have developed specialized programming models, such as CUDA and HIP, which enable developers to leverage the full computational power of modern GPUs. This article introduces an open-source tool designed to bridge the gap between Unity and CUDA, allowing developers to integrate CUDA's capabilities within Unity-based *** proposed solution establishes an interoperability framework that facilitates communication between Unity and CUDA. The tool is designed to efficiently transfer data, execute CUDA kernels, and retrieve results, ensuring seamless integration into Unity's rendering and computation *** tool extends Unity's capabilities by enabling CUDA-based computations, overcoming the inherent limitations of Unity's compute shader language. This integration allows developers to exploit multi-GPU architectures, leverage advanced mathematical functions, and enhance computational performance for real-time applications.
This article presents through an environment of parallel Objects, an approach to Structured parallel programming and the Object-Orientation paradigm, a programming methodology based on High Level parallel Compositions...
详细信息
ISBN:
(纸本)9788897999324
This article presents through an environment of parallel Objects, an approach to Structured parallel programming and the Object-Orientation paradigm, a programming methodology based on High Level parallel Compositions (HLPC). By means of the method application, the parallelization of commonly used communication patterns among processes is presented, which is initially constituted by the HLPCs Farm, Pipe and TreeDV that represent, respectively, the patterns of communication Farm, Pipeline and Binary Tree, the latter one used within a parallel version of the design technique known as Divide and Conquer.
Network availability is an essential feature of an optical telecommunication network. Should a failure of a network component occur, be it a link or a component inside a node, network control plane must be able to det...
详细信息
Deterministic-by-construction parallel programming models offer the advantages of parallel speedup while avoiding the nondeterministic, hard-to-reproduce bugs that plague fully concurrent code. A principled approach t...
详细信息
Deterministic-by-construction parallel programming models offer the advantages of parallel speedup while avoiding the nondeterministic, hard-to-reproduce bugs that plague fully concurrent code. A principled approach to deterministic-by-construction parallel programming with shared state is offered by LVars: shared memory locations whose semantics are defined in terms of an applicationspecific lattice. Writes to an LVar take the least upper bound of the old and new values with respect to the lattice, while reads from an LVar can observe only that its contents have crossed a specified threshold in the lattice. Although it guarantees determinism, this interface is quite limited. We extend LVars in two ways. First, we add the ability to freeze and then read the contents of an LVar directly. Second, we add the ability to attach event handlers to an LVar, triggering a callback when the LVar's value changes. Together, handlers and freezing enable an expressive and useful style of parallel programming. We prove that in a language where communication takes place through these extended LVars, programs are at worst quasideterministic: on every run, they either produce the same answer or raise an error. We demonstrate the viability of our approach by implementing a library for Haskell supporting a variety of LVarbased data structures, together with a case study that illustrates the programming model and yields promising parallel speedup.
Recent advances in hardware architectures, particularly multicore and manycore systems, implicitly require programmers to write concurrent programs. However, writing correct and efficient concurrent programs is challe...
详细信息
ISBN:
(纸本)9781450332088
Recent advances in hardware architectures, particularly multicore and manycore systems, implicitly require programmers to write concurrent programs. However, writing correct and efficient concurrent programs is challenging. We envision a system where the concurrent programs can be self-adaptive when executing on different hardware. We have developed two different tuning policies, which enable users' programs to adjust their level of concurrency at compiletime and run-time respectively. Copyright is held by the owner/author(s).
Dynamic Task Scheduling is an enticing programming model aiming to ease the development of parallel programs with intrinsically irregular or data-dependent parallelism. The performance of such solutions relies on the ...
详细信息
Dynamic Task Scheduling is an enticing programming model aiming to ease the development of parallel programs with intrinsically irregular or data-dependent parallelism. The performance of such solutions relies on the ability of the Task Scheduling HW/SW stack to efficiently evaluate dependencies at runtime and schedule work to available cores. Traditional SW-only systems implicate scheduling overheads of around 30K processor cycles per task, which severely limit the (core count, task granularity) combinations that they might adequately handle. Previous work on HW-accelerated Task Scheduling has shown that such systems might support high performance scheduling on processors with up to eight cores, but questions remained regarding the viability of such solutions to support the greater number of cores now frequently found in high-end SMP systems. The present work presents an FPGA-proven, tightly-integrated, Linux-capable, 30-core RISC-V system with hardware accelerated Task Scheduling. We use this implementation to show that HW Task Scheduling can still offer competitive performance at such high core count, and describe how this organization includes hardware and software optimizations that make it even more scalable than previous solutions. Finally, we outline ways in which this architecture could be augmented to overcome inter-core communication bottlenecks, mitigating the cache-degradation effects usually involved in the parallelization of highly optimized serial code.
Mixed Integer Linear programming (MILP) is utilized in behavioral synthesis as a mathematical model to design efficient hardware. However, solving large MILP models poses significant computational challenges due to th...
详细信息
Mixed Integer Linear programming (MILP) is utilized in behavioral synthesis as a mathematical model to design efficient hardware. However, solving large MILP models poses significant computational challenges due to their NP-hard nature. paralleling can tackle this challenge by amortizing the execution time, yet unbalanced loads can hinder its effectiveness. In this paper, we address the load balance issue of parallel Branch and Bound (B&B) algorithms, particularly sub-tree parallelism, which exhibit efficiency in solving MILP models derived from behavioral synthesis. The proposed algorithm strategically partitions the original problem into sub-problems by selecting decision variables that appear in a higher number of constraints to prioritize load balance and enhance solver performance. We evaluate the effectiveness of our method using MILP models derived from Mediabench data flow graphs of various sizes. The experimental results indicate that the proposed algorithm achieves speedups ranging from approximately 1 to 13 times, highlighting its efficacy in improving the scalability and efficiency of MILP solving for behavioral synthesis.
Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and afforda...
详细信息
Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.
暂无评论