We develop a fully asynchronous proximal bundle method for solving non-smooth, convex optimization problems. The algorithm can be used as a drop-in replacement for classic bundle methods, i.e., the function must be gi...
详细信息
We develop a fully asynchronous proximal bundle method for solving non-smooth, convex optimization problems. The algorithm can be used as a drop-in replacement for classic bundle methods, i.e., the function must be given by a first-order oracle for computing function values and subgradients. The algorithm allows for an arbitrary number of master problem processes computing new candidate points and oracle processes evaluating functions at those candidate points. These processes share information by communication with a single supervisor process that resembles the main loop of a classic bundle method. All processes run in parallel and no explicit synchronization step is required. Instead, the asynchronous and possibly outdated results of the oracle computations can be seen as an inexact function oracle. Hence, we show the convergence of our method under weak assumptions very similar to inexact and incremental bundle methods. In particular, we show how the algorithm learns important structural properties of the functions to control the inaccuracy induced by the asynchronicity automatically such that overall convergence can be guaranteed.
This paper introduces Taskflow to address the critical question of "How can we make it easier to implement and deploy parallel computer-aided design (CAD) algorithms on large heterogeneous nodes with high perform...
详细信息
ISBN:
(纸本)9781665423243
This paper introduces Taskflow to address the critical question of "How can we make it easier to implement and deploy parallel computer-aided design (CAD) algorithms on large heterogeneous nodes with high performance and simultaneous high productivity?" parallelizing CAD is an extremely challenging job. Modern CAD applications exhibit unique computational patterns and user requirements that need very strategic decomposition to benefit from parallelism. Taskflow assists researchers and developers in the implementation complexity of parallel algorithms by introducing a new high-level programming model supported by an efficient run-time. By capitalizing on emerging parallelism comprising many-core central processing units (CPUs), graphics processing units (GPUs), and custom accelerators, Taskflow enables CAD to achieve new performance and productivity milestones that were previously out of reach.
Multi-core shared memory architectures have become ubiquitous in computing hardware nowadays. As a result, there is a growing need to fully utilize these architectures by introducing appropriate parallelization scheme...
详细信息
Multi-core shared memory architectures have become ubiquitous in computing hardware nowadays. As a result, there is a growing need to fully utilize these architectures by introducing appropriate parallelization schemes, such as OpenMP worksharing-loop constructs, to applications. However, most developers find introducing OpenMP directives to their code hard due to pervasive pitfalls in managing parallel shared memory. To assist developers in this process, many compilers, as well as source-to-source (S2S) translation tools, have been developed over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. Recently, many data-driven AI-based code completion (CC) tools, such as GitHub CoPilot, have been developed to ease and improve programming productivity. Leveraging the insights from existing AI-based programming-assistance tools, this work presents a novel AI model that can serve as a parallel-programming assistant. Specifically, our model, named PragFormer, is tasked with identifying for loops that can benefit from conversion to parallel worksharing-loop construct (OpenMP directive) and even predict the need for specific data-sharing attributes clauses on the fly. We created a unique database, named Open-OMP, specifically for this goal. Open-OMP contains over 32,000 unique code snippets from different domains, half of which contain OpenMP directives, while the other half do not. We experimented with different model design parameters for these tasks and showed that our best-performing model outperforms a statistically-trained baseline as well as a state-of-the-art S2S compiler. In fact, it even outperforms the popular generative AI model of ChatGPT. In the spirit of advancing research on this topic, we have already released source code for Pra
We describe a successful addition of high performance computing (HPC) into a traditional computer science curriculum at a liberal arts university. The approach incorporated a three-semester sequence of courses emphasi...
详细信息
ISBN:
(纸本)9781450328937
We describe a successful addition of high performance computing (HPC) into a traditional computer science curriculum at a liberal arts university. The approach incorporated a three-semester sequence of courses emphasizing parallel programming techniques, with the final course focusing on a research-level mathematical project that was executed on a TOP500 supercomputer. A group of students with varied programming backgrounds participated in the program. Emphasis was placed on utilizing the Open MPI and CUDA libraries along with parallel algorithm and file I/O analysis. Copyright 2014 ACM.
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, lev...
详细信息
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ's design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue's state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2x while preserving FIFO semantics. The paper demonstrates BACQ's superior performance through real-world empirical evaluations.
This article presents through an environment of parallel Objects, an approach to Structured parallel programming and the Object-Orientation paradigm, a programming methodology based on High Level parallel Compositions...
详细信息
ISBN:
(纸本)9788897999324
This article presents through an environment of parallel Objects, an approach to Structured parallel programming and the Object-Orientation paradigm, a programming methodology based on High Level parallel Compositions (HLPC). By means of the method application, the parallelization of commonly used communication patterns among processes is presented, which is initially constituted by the HLPCs Farm, Pipe and TreeDV that represent, respectively, the patterns of communication Farm, Pipeline and Binary Tree, the latter one used within a parallel version of the design technique known as Divide and Conquer.
Network availability is an essential feature of an optical telecommunication network. Should a failure of a network component occur, be it a link or a component inside a node, network control plane must be able to det...
详细信息
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU ...
详细信息
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support and incomplete mathematical *** address these limitations, GPU manufacturers have developed specialized programming models, such as CUDA and HIP, which enable developers to leverage the full computational power of modern GPUs. This article introduces an open-source tool designed to bridge the gap between Unity and CUDA, allowing developers to integrate CUDA's capabilities within Unity-based *** proposed solution establishes an interoperability framework that facilitates communication between Unity and CUDA. The tool is designed to efficiently transfer data, execute CUDA kernels, and retrieve results, ensuring seamless integration into Unity's rendering and computation *** tool extends Unity's capabilities by enabling CUDA-based computations, overcoming the inherent limitations of Unity's compute shader language. This integration allows developers to exploit multi-GPU architectures, leverage advanced mathematical functions, and enhance computational performance for real-time applications.
Deterministic-by-construction parallel programming models offer the advantages of parallel speedup while avoiding the nondeterministic, hard-to-reproduce bugs that plague fully concurrent code. A principled approach t...
详细信息
Deterministic-by-construction parallel programming models offer the advantages of parallel speedup while avoiding the nondeterministic, hard-to-reproduce bugs that plague fully concurrent code. A principled approach to deterministic-by-construction parallel programming with shared state is offered by LVars: shared memory locations whose semantics are defined in terms of an applicationspecific lattice. Writes to an LVar take the least upper bound of the old and new values with respect to the lattice, while reads from an LVar can observe only that its contents have crossed a specified threshold in the lattice. Although it guarantees determinism, this interface is quite limited. We extend LVars in two ways. First, we add the ability to freeze and then read the contents of an LVar directly. Second, we add the ability to attach event handlers to an LVar, triggering a callback when the LVar's value changes. Together, handlers and freezing enable an expressive and useful style of parallel programming. We prove that in a language where communication takes place through these extended LVars, programs are at worst quasideterministic: on every run, they either produce the same answer or raise an error. We demonstrate the viability of our approach by implementing a library for Haskell supporting a variety of LVarbased data structures, together with a case study that illustrates the programming model and yields promising parallel speedup.
Recent advances in hardware architectures, particularly multicore and manycore systems, implicitly require programmers to write concurrent programs. However, writing correct and efficient concurrent programs is challe...
详细信息
ISBN:
(纸本)9781450332088
Recent advances in hardware architectures, particularly multicore and manycore systems, implicitly require programmers to write concurrent programs. However, writing correct and efficient concurrent programs is challenging. We envision a system where the concurrent programs can be self-adaptive when executing on different hardware. We have developed two different tuning policies, which enable users' programs to adjust their level of concurrency at compiletime and run-time respectively. Copyright is held by the owner/author(s).
暂无评论