Parallel Discrete Event Simulation (PDES) is a modelling technique that takes advantage of concurrent computing resources. However, its asynchronous nature can present challenges for efficient execution. This paper pr...
详细信息
ISBN:
(纸本)9798400703638
Parallel Discrete Event Simulation (PDES) is a modelling technique that takes advantage of concurrent computing resources. However, its asynchronous nature can present challenges for efficient execution. This paper proposes a new non-blocking management system for handling messages and anti-messages in Time Warp simulations. This approach exploits the benefits of non-blocking algorithms to surpass the limitations of existing blocking mechanisms, resulting in more efficient and scalable simulations. Specifically, the approach relies on efficient atomic fetch-and-add operations provided by modern computer architectures for evaluating and updating the status of the event.
Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale, a problem that has been addressed...
详细信息
Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators-the bottom-most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article, we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy system, where threads performing concurrent allocations/releases do not undergo any spin-lock based synchronization. Our solution allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling the allocator metadata-memory fragmentation and coalescing are also carried out in a fully non-blocking manner. Conflict detection relies in our solution on atomic Read-Modify-Write (RMW) machine instructions, guaranteed to execute atomically by the processor firmware. We also provide a proof of the correctness of our non-blocking buddy system and show the results of an experimental study that outlines the effectiveness of our solution.
To maximize the performance of concurrent data structures, researchers have often turned to highly complex fine-grained techniques, resulting in efficient an d elegant algorithms, which can however be often difficult ...
详细信息
ISBN:
(纸本)9781450392044
To maximize the performance of concurrent data structures, researchers have often turned to highly complex fine-grained techniques, resulting in efficient an d elegant algorithms, which can however be often difficult to understand and prove correct. While simpler techniques exist, such as transactional memory, they can have limited performance or portability relative to their fine-grained counterparts. Approaches at both ends of this complexity-performance spectrum have been extensively explored, but relatively less is known about the middle ground: approaches that are willing to sacrifice some performance for simplicity, while remaining competitive with state-of-the-art handcrafted designs. In this paper, we explore this middle ground, and present PathCAS, a primitive that combines ideas from multi-word CAS (KCAS) and transactional memory approaches, while carefully avoiding overhead. We show how PathCAS can be used to implement efficient search data structures relatively simply, using an internal binary search tree as an example, then extending this to an AVL tree. Our best implementations outperform many handcrafted search trees: in search-heavy workloads, it rivals the BCCO tree [5], the fastest known concurrent binary tree in terms of search performance [3]. Our results suggest that PathCAS can yield concurrent data structures that are relatively easy to build and prove correct, while offering surprisingly high performance.
Weak memory models implemented on modern multicore processors are known to affect the correctness of concurrent code. They can also affect whether or not the concurrent code is secure. This is particularly the case in...
详细信息
Weak memory models implemented on modern multicore processors are known to affect the correctness of concurrent code. They can also affect whether or not the concurrent code is secure. This is particularly the case in programs where the security levels of variables are value-dependent, i.e., depend on the values of other variables. In this paper, we illustrate how instruction reordering allowed by ARM and POWER multicore processors leads to vulnerabilities in such programs, and present a compositional, flow-sensitive information-flow logic which can be used to detect such vulnerabilities. The logic allows step-local reasoning (one instruction at a time) about a thread's security by tracking information about dependencies between instructions which guarantee their order of occurrence. Program security can then be established from individual thread security using rely/guarantee reasoning. The logic has been proved sound with respect to existing operational semantics using Isabelle/HOL, and implemented in an automatic symbolic execution tool.
Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexit...
详细信息
ISBN:
(纸本)9781450368186
Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexity and better performance. Although interpolation-based data structures were investigated in the past, their non-blocking concurrent variants have received very little attention so far. In this paper, we propose the first non-blocking implementation of the classic interpolation search tree (IST) data structure. For arbitrary key distributions, the data structure ensures worst-case O(logn + p) amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected O(log logn + p) time, and insertion and deletion run in expected amortized O(log logn + p) time, where p is a bound on the number of threads. To improve the scalability of concurrent insertion and deletion, we propose a novel parallel rebuilding technique, which should be of independent interest. We evaluate whether the theoretical improvements translate to practice by implementing the concurrent interpolation search tree, and benchmarking it on uniform and non-uniform key distributions, for dataset sizes in the millions to billions of keys. Relative to the state-of-the-art concurrent data structures, the concurrent interpolation search tree achieves performance improvements of up to 15% under high update rates, and of up to 50% under moderate update rates. Further, ISTs exhibit up to 2x less cache-misses, and consume 1.2 - 2.6x less memory compared to the next best alternative on typical dataset sizes. We find that the results are surprisingly robust to distributional skew, which suggests that our data structure can be a promising alternative to classic concurrent search structures.
Parallelizing (compute-intensive) discrete event simulation (DES) applications is a classical approach for speeding up their execution and for making very large/complex simulation models tractable. This has been histo...
详细信息
Parallelizing (compute-intensive) discrete event simulation (DES) applications is a classical approach for speeding up their execution and for making very large/complex simulation models tractable. This has been historically achieved via parallel DES (PDES) techniques, which are based on partitioning the simulation model into distinct simulation objects (somehow resembling objects in classical object-oriented programming), whose states are disjoint, which are executed concurrently and rely on explicit event-exchange (or event-scheduling) primitives as the means to support mutual dependencies and notification of their state updates. With this approach, the application developer is necessarily forced to reason about state separation across the objects, thus being not allowed to rely on shared information, such as global variables, within the application code. This implicitly leads to the shift of the user-exposed programming model to one where sequential-style global variable accesses within the application code are not allowed. In this article we remove this limitation by providing support for managing global variables in the context of DES code developed in ANSI-C, which gets automatically parallelized. Particularly, we focus on speculative (also termed optimistic) PDES systems that run on top of multi-core machines, where simulation objects can concurrently process their events with no guarantee of causal consistency and actual violations of causality rules are recovered through rollback/recovery schemes. In compliance with the nature of speculative processing, in our proposal global variables are transparently mapped to multi-versions, so as to avoid any form of safety predicate verification upon their updates. Consistency is ensured via the introduction of a new rollback/recovery scheme based on detecting global variables' reads on non-correct versions. At the same time, efficiency in the execution is guaranteed by managing multi-version variables' lists via non-
The large diffusion of highly-parallel sharedmemory multi-core machines has led Parallel Discrete Event Simulation (PDES) platforms to a shift towards a share-everything model. This model is based on loose coupling be...
详细信息
ISBN:
(纸本)9781509035052
The large diffusion of highly-parallel sharedmemory multi-core machines has led Parallel Discrete Event Simulation (PDES) platforms to a shift towards a share-everything model. This model is based on loose coupling between simulation objects and threads, lasting (as an extreme) no more than the lifetime of individual events. Concurrent threads can therefore CPU-dispatch events destined to any object at any point in time, thus fully sharing the workload of events to be processed on a fine grain basis. This demands for efficient mechanisms to share the overall pool of pending events by enabling parallelism in insertion and extraction operations. In this article we present a lock-free event pool which also provides amortized O(1) time complexity for both insertions and extractions. It can sustain highly concurrent accesses, while not leading to noticeable performance degradation when scaling up the thread count. Experimental results demonstrate that our solution stands as a core facility capable of further raising up the pragmatical impact of such an emerging shareeverything PDES paradigm.
Security enforcement inlined into user threads often delays the protected programs;inlined resource reclamation may interrupt program execution and defer resource release. We propose software cruising, a novel techniq...
详细信息
ISBN:
(纸本)9781450306638
Security enforcement inlined into user threads often delays the protected programs;inlined resource reclamation may interrupt program execution and defer resource release. We propose software cruising, a novel technique that migrates security enforcement and resource reclamation from user threads to a concurrent monitor thread. The technique leverages the increasingly popular multicore and multiprocessor architectures and uses lock-free data structures to achieve non-blocking and efficient synchronization between the monitor and user threads. As a case study, software cruising is applied to the heap buffer overflow problem. Previous mitigation and detection techniques for this problem suffer from high performance overhead, legacy code compatibility, semantics loyalty, or tedious manual program transformation. We present a concurrent heap buffer overflow detector, CRUISER, in which a concurrent thread is added to the user program to monitor heap integrity, and custom lock-free data structures and algorithms are designed to achieve high efficiency and scalability. The experiments show that our approach is practical: it imposes an average of 5% performance overhead on SPEC CPU2006, and the throughput slowdown on Apache is negligible on average.
We describe a methodology for transforming a large class of highly-concurrent linearizable objects into highly-concurrent transactional objects. As long as the linearizable implementation satisfies certain regularity ...
详细信息
ISBN:
(纸本)9781595939609
We describe a methodology for transforming a large class of highly-concurrent linearizable objects into highly-concurrent transactional objects. As long as the linearizable implementation satisfies certain regularity properties (informally, that every method has an inverse), we define a simple wrapper for the linearizable implementation that guarantees that concurrent transactions without inherent conflicts can synchronize at the same granularity as the original linearizable implementation.
Enabling message communication among concurrent computing threads without relying on mutual exclusion (i.e., locking) is highly desirable in real-time computing systems. This paper presents a refined version of the No...
详细信息
ISBN:
(纸本)9780769527659
Enabling message communication among concurrent computing threads without relying on mutual exclusion (i.e., locking) is highly desirable in real-time computing systems. This paper presents a refined version of the non-blocking Buffer (NBB), which is a lock-free interaction mechanism that enables efficient event-message communication between a single producer thread. and a single consumer thread. The NBB scheme presented here contains improvements over the previous version in two aspects. First, application designers now have the flexibility of choosing the consumer's retry strategy for the case when the buffer is empty but the producer is in the middle of inserting an item. Second, in the refined version the producer inserts pointers to data items into the buffer whereas the consumer obtains copies of the items. This design is consistent with the fact that shared heap management must be avoided to enable fully lock-free interaction between the producer and the consumer This paper also discusses the approaches based on the NBB mechanism for supporting all conceivable producer-consumer scenarios.
暂无评论