Chase and Lev's concurrent deque is a key data structure in shared-memory parallel programming and plays an essential role in work-stealing schedulers. We provide the first correctness proof of an optimized implem...
详细信息
Chase and Lev's concurrent deque is a key data structure in shared-memory parallel programming and plays an essential role in work-stealing schedulers. We provide the first correctness proof of an optimized implementation of Chase and Lev's deque on top of the POWER and ARM architectures: these provide very relaxed memory models, which we exploit to improve performance but considerably complicate the reasoning. We also study an optimized x86 and a portable C11 implementation, conducting systematic experiments to evaluate the impact of memory barrier optimizations. Our results demonstrate the benefits of hand tuning the deque code when running on top of relaxed memory models.
Bounded single-producer single-consumer FIFO queues are one of the simplest concurrent data-structure, and they do not require more than sequential consistency for correct operation. Still, sequential consistency is a...
详细信息
ISBN:
(纸本)9781479929276
Bounded single-producer single-consumer FIFO queues are one of the simplest concurrent data-structure, and they do not require more than sequential consistency for correct operation. Still, sequential consistency is an unrealistic hypothesis on shared-memory multiprocessors, and enforcing it through memory barriers induces significant performance and energy overhead. This paper revisits the optimization and correctness proof of bounded FIFO queues in the context of weak memory consistency, building upon the recent axiomatic formalization of the C11 memory model. We validate the portability and performance of our proven implementation over 3 processor architectures with diverse hardware memory models, including ARM and PowerPC. Comparison with state-of-the-art implementations demonstrate consistent improvements for a wide range of buffer and batch sizes.
lock-free algorithms have been developed to avoid various problems associated with using locks to control access to shared data structures. Instead of preventing interference between processes using mutual exclusion, ...
详细信息
lock-free algorithms have been developed to avoid various problems associated with using locks to control access to shared data structures. Instead of preventing interference between processes using mutual exclusion, lock-free algorithms must ensure correct behaviour in the presence of interference. While this avoids the problems with locks, the resulting algorithms are typically more intricate than lock-based algorithms, and allow more complex interactions between processes. The result is that even when the basic idea is easy to understand, the code implementing lock-free algorithms is typically very subtle, hard to understand, and hard to get right. In this paper, we consider the well-known lock-free queue implementation due to Michael and Scott, and show how a slightly simplified version of this algorithm can be derived from an abstract specification via a series of verifiable refinement steps. Reconstructing a design history in this way allows us to examine the kinds of design decisions that underlie the algorithm as describe by Michael and Scott, and to explore the consequences of some alternative design choices. Our derivation is based on a refinement calculus with concurrent composition, combined with a reduction approach, based on that proposed by Lipton, Lamport, Cohen, and others, which we have previously used to derive a scalable stack algorithm. The derivation of Michael and Scott's queue algorithm introduces some additional challenges because it uses a "helper" mechanism which means that part of an enqueue operation can be performed by any process, also in a simulation proof the treatment of dequeue on an empty queue requires the use of backward simulation
Shared memory multiprocessor systems typically provide a set of hardware primitives in order to support synchronization. Generally, they provide single-word read-modify-write hardware primitives such as compare-and-sw...
详细信息
ISBN:
(纸本)0769520219
Shared memory multiprocessor systems typically provide a set of hardware primitives in order to support synchronization. Generally, they provide single-word read-modify-write hardware primitives such as compare-and-swap, load-linked/store-conditional and fetch-and-op, from which the higher-level synchronization operations are then implemented in soft-ware. Although the single-word hardware primitives are conceptually powerful enough to support higher-level synchronization, from the programmer's point of view they are not as useful as their generalizations to the multi-word objects. This paper presents two fast and reactive lock-free multi-word compare-and-swap algorithms. The algorithms dynamically measure the level of contention as well as the memory conflicts of the multi-word compare-and-swap operations, and in response, they react accordingly in order to guarantee good performance in a wide range of system conditions. The algorithms are non-blocking (lock-free), allowing in this way fast dynamical behavior Experiments on thirty processors of a SGI Origin2000 multiprocessor show that both our algorithms react fast according to the contention variations and perform from two to nine times faster than the best known alternatives in all contention conditions.
暂无评论