Mature research advances in scheduling theory show that carefully-crafted concurrent computational models permit static analysis of real-time behavior. This evidence enables designers to consider using suitable forms ...
详细信息
Mature research advances in scheduling theory show that carefully-crafted concurrent computational models permit static analysis of real-time behavior. This evidence enables designers to consider using suitable forms of explicit concurrency to model the inherent concurrency of real-time systems. The Ravenscar Profile, a specifically tailored subset of the Ada 95 tasking model, defines a compact and efficient concurrent computational model, especially suited for the development of high integrity, high efficiency real-time systems. Ravenscar runtimes can be implemented by small, efficient, reliable and certifiable kernels. At least two such implementations already exist and are being industrially deployed. The simplicity and intrinsic determinism of Ravenscar kernels facilitate the definition of metrics that cater for very accurate characterization of the dynamic behavior of the runtime and of the execution time of its primitives. Accurate runtime metrics enable forms of response time analysis that minimize the pessimism in the prediction of the runtime influence on the application. This is especially useful for concurrent systems that exhibit significant dependency on runtime support services. This paper recalls the motivations of the Ravenscar Profile, outlines the definition of it and formulates a precise characterisation of the associated runtime metrics.
Once started, existing hash tables cannot change their pre-defined hash functions, even if the incoming data cannot be evenly distributed to the hash table buckets. In this paper, we present DHash, a type of hash tabl...
详细信息
Once started, existing hash tables cannot change their pre-defined hash functions, even if the incoming data cannot be evenly distributed to the hash table buckets. In this paper, we present DHash, a type of hash table for shared memory systems, that can change its hash function and rebuild the hash table on the fly, without noticeably degrading its service. The major technical novelty of DHash stems from an efficient distributing mechanism that can atomically distribute every node when rebuilding, without locking the corresponding hash table buckets. This not only enables non-blocking lookup, insert, and delete operations, but more importantly, makes DHash independent of the implementation of hash table buckets, such that DHash allows programmers to select the set algorithms that meet their requirements best from a variety of existing lock-free and wait-free set algorithms. Evaluations show that DHash can efficiently change its hash function on the fly. Moreover, when rebuilding, DHash consistently outperforms the state-of-the-art hash tables in terms of throughput and response time of concurrent operations, at different concurrency levels, and with different operation mixes and average load factors.
We show how to specify components of concurrent systems. The specification of a system is the conjunction of its components' specifications. Properties of the system are proved by reasoning about its components. W...
详细信息
We show how to specify components of concurrent systems. The specification of a system is the conjunction of its components' specifications. Properties of the system are proved by reasoning about its components. We consider both the decomposition of a given system into parts, and the composition of given parts to form a system.
One problem with debugging (committed choice) concurrent logic programs is that their behaviour may be non-deterministic, in that successive executions of the same program may produce different results. We describe a ...
详细信息
One problem with debugging (committed choice) concurrent logic programs is that their behaviour may be non-deterministic, in that successive executions of the same program may produce different results. We describe a scheme, based on the 'Instant Replay' scheme developed for more conventional parallel languages, that allows us to reproduce the execution behaviour of a concurrent logic program on subsequent executions, so that the execution may be examined for debugging purposes. The properties of concurrent logic programming languages allow us to simplify our scheme greatly. We have demonstrated our scheme with KLIC, and KL1 on the PIM multiprocessors, but it can also be applied to other committed choice concurrent logic programming languages.
This paper presents synchronizing interoperable resources (SIR). SIR extends to multi-program environments the concurrent communication mechanisms in the SR concurrent programming language. This paper discusses design...
详细信息
This paper presents synchronizing interoperable resources (SIR). SIR extends to multi-program environments the concurrent communication mechanisms in the SR concurrent programming language. This paper discusses design and implementation issues including implicit binding, a mechanism for providing seamless concurrent communication. It also examines some performance results of SIR as well as presenting qualitative analysis. This paper also compares SIR with CORBA and other systems that provide interoperability. (C) 2002 Elsevier Science Ltd. All rights reserved.
This article presents an algorithm for detecting deadlocks in concurrent finite-state systems without incurring most of the state explosion due to the modeling of concurrency by interleaving. For systems that have a h...
详细信息
This article presents an algorithm for detecting deadlocks in concurrent finite-state systems without incurring most of the state explosion due to the modeling of concurrency by interleaving. For systems that have a high level of concurrency, our algorithm can be much more efficient than the classical exploration of the whole state space. Finally, we show that our algorithm can also be used for verifying arbitrary safety properties.
This paper presents a new approach to programming multiway rendezvous problems in the SR language. The approach uses SR's concurrent invocation statement and rendezvous mechanism to coordinate the interacting proc...
详细信息
This paper presents a new approach to programming multiway rendezvous problems in the SR language. The approach uses SR's concurrent invocation statement and rendezvous mechanism to coordinate the interacting processes. This approach is compared with one that suggested an extension to SR's rendezvous mechanism. The two approaches result in differing program structure. The new approach is shown to lead to simpler and cleaner interfaces between the main process and the worker processes, and uses only existing language mechanisms. The results are of importance to both programmers and designers of concurrent program languages.
Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing...
详细信息
Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs;and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.
Even though Software Transactional Memory (STM) is one of the most promising approaches to simplify concurrent programming, current STM implementations incur significant overheads that render them impractical for many...
详细信息
Even though Software Transactional Memory (STM) is one of the most promising approaches to simplify concurrent programming, current STM implementations incur significant overheads that render them impractical for many real-sized programs. The key insight of this work is that we do not need to use the same costly barriers for all the memory managed by a real-sized application, if only a small fraction of the memory is under contention lightweight barriers may be used in this case. In this work, we propose a new solution based on an approach of adaptive object metadata (AOM) to promote the use of a fast path to access objects that are not under contention. We show that this approach is able to make the performance of an STM competitive with the best fine-grained lock-based approaches in some of the more challenging benchmarks. (C) 2015 Elsevier Inc. All rights reserved.
The single-instruction multiple thread (SIMT) architecture that can be found in some latest graphical processing units (GPUs) builds on the conventional single-instruction multiple data (SIMD) parallelism while adopti...
详细信息
The single-instruction multiple thread (SIMT) architecture that can be found in some latest graphical processing units (GPUs) builds on the conventional single-instruction multiple data (SIMD) parallelism while adopting the thread programming model. The architecture suffers from a degraded performance caused by the inefficient divergence handling, a problem hidden by the programmer's view of independent threads. A loop optimization technique having the potential to increase efficiency of the core SIMD block while processing embedded divergences is investigated here. concurrent loops are generally not bound to iterate in lock-step, allowing better alignment of thread flows via iteration scheduling. The concept efficiency is analyzed for fixed and flow-adapting scheduling policies. The proposed payoff model captures loop overhead implications, allowing one to assess the tradeoffs of applying the technique to a specific loop instance. Processing speedups can generally be observed in the total running time if kernels are compute-bound, as demonstrated by several examples. The studied iteration scheduling policies do not impose alterations to the core SIMD concept and design, thus preserving the benefits of data level parallelism.
暂无评论