Run-time errors in concurrent programs are generally due to the wrong usage of synchronization primitives such as monitors. Conventional validation techniques such as testing become ineffective for concurrent programs...
详细信息
ISBN:
(纸本)9781581135626
Run-time errors in concurrent programs are generally due to the wrong usage of synchronization primitives such as monitors. Conventional validation techniques such as testing become ineffective for concurrent programs since the state space increases exponentially with the number of concurrent processes. In this paper, we propose an approach in which 1) the concurrency control component of a concurrent program is formally specified, 2) it is verified automatically using model checking, and 3) the code for concurrency control component is automatically generated. We use monitors as the synchronization primitive to control access to a shared resource by multipleconcurrent processes. Since our approach decouples the concurrency control component from the rest of the implementation it is scalable. We demonstrate the usefulness of our approach by applying it to a case study on Airport Ground Traffic *** use the Action Language to specify the concurrency control component of a system. Action Language is a specification language for reactive software systems. It is supported by an infinite-state model checker that can verify systems with boolean, enumerated and udbounded integer variables. Our code generation tool automatically translates the verified Action Language specification into a Java monitor. Our translation algorithm employs symbolic manipulation techniques and the specific notification pattern to generate an optimized monitor class by eliminating the context switch overhead introduced as a result of unnecessary thread notification. Using counting abstraction, we show that we can automatically verify the monitor specifications for arbitrary number of threads.
Distributed shared memory (DSM) machines provide the shared memory paradigm and achieve high performance by the caching of shared data. However, they suffer from cache miss and remote access latency with coarse-grain ...
详细信息
ISBN:
(纸本)0769505892
Distributed shared memory (DSM) machines provide the shared memory paradigm and achieve high performance by the caching of shared data. However, they suffer from cache miss and remote access latency with coarse-grain patterns. In this paper we suggest the combination of bulk transfer and prefetching as a new latency hiding technique in DSM machines. The purpose of bulk transfer is to replicate remote data into local memory and thus reduce remote accesses. Adaptive granularity was used for bulk transfer. Prefetching is added to fetch replicated data to the cache at the right time. We could apply simple prefetch scheduling as in uniprocessors since bulk transfer converts remote accesses into local ones. Simulation results show the reduced latency and the potential of AG as a preferable architecture for prefetching in DSM machines.
The ldquomain-streamrdquo inter-process communication models (share-memory and message-passing) require the programmers responsible for the construction of a very complex state machine for parallel processing. This ha...
详细信息
The ldquomain-streamrdquo inter-process communication models (share-memory and message-passing) require the programmers responsible for the construction of a very complex state machine for parallel processing. This has resulted multiple difficulties including programming, performance tuning, debugging, job scheduling and fault tolerance. The most troubling is the degree of difficulties. It increases exponentially as the multiprocessor grows in size. Inspired by the successes of packet switching protocols, this paper reports our preliminary findings in using decoupling technologies for parallel applications with high reliability, high performance and programmability.
The behavioral correctness of parallel programs has a pivotal role in computational sciences and engineering applications as researchers draw scientific conclusions from the results generated by parallel applications....
详细信息
ISBN:
(纸本)9781424437184
The behavioral correctness of parallel programs has a pivotal role in computational sciences and engineering applications as researchers draw scientific conclusions from the results generated by parallel applications. Moreover, with the advent of multicore processors, the development of parallel programs should be facilitated for the mainstream developers. While numerous programming models and APIs exist for parallel programming, we pose the view that more emphasis should be placed on designing the synchronization mechanisms of parallel programs independent from the design of their functional behaviors. More importantly, programs behaviors evolve (due to new requirements and change of configuration), thereby creating a need for techniques and tools that enable developers to reason about the behavioral evolution of parallel programs. With such motivations, we introduce a framework for automated design/evolution of the synchronization mechanisms of parallel programs.
Rendering, in particular the computation of global illumination, uses computationally very demanding algorithms. As a consequence many researchers have looked into speeding up the computation by distributing it over a...
详细信息
Rendering, in particular the computation of global illumination, uses computationally very demanding algorithms. As a consequence many researchers have looked into speeding up the computation by distributing it over a number of computational units. However, in almost all cases did they completely redesign the relevant algorithms in order to achieve high efficiency for the particular distributed or parallel environment. At the same time global illumination algorithms have become more and more sophisticated and complex. Often several basic algorithms are combined in multi-pass arrangements to achieve the desired lighting effects. As a result, it is becoming increasingly difficult to analyze and adapt the algorithms for optimal parallel execution at the lower levels. Furthermore, these bottom-up approaches destroy the basic design of an algorithm by polluting it with distribution logic and thus easily making it unmaintainable. We present a top-down approach for designing distributed applications based on their existing object-oriented decomposition. Distribution logic, in our case based on the CORBA middleware standard, is introduced transparently to the existing application logic. The design approach is demonstrated using several examples of multi-pass global illumination computation and ray tracing. The results show that a good speedup can usually be obtained even with minimal intervention into existing applications.
programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by ...
详细信息
programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by a distributed operating system alone, but must involve the application programmer. Instead of the underlying message passing architecture being shielded from the programmer, it should be explicitly modeled. Three key concepts of a parallel operating system-dual, mobile and reactive objects-are presented. They provide simple but efficient mechanisms that can be easily utilized for such complex tasks as load balancing, i.e., initial placement and migration of application entities. To illustrate the applicability of these concepts, a simple VR application-geoview-was implemented on a message passing architecture, and serves as an example throughout the paper.
We present a case for automating the selection of MPI-IO performance optimizations, with an ultimate goal to relieve the application programmer from these details, thereby improving their productivity. Programmers pro...
详细信息
We present a case for automating the selection of MPI-IO performance optimizations, with an ultimate goal to relieve the application programmer from these details, thereby improving their productivity. Programmers productivity has always been overlooked as compared to the performance optimizations in high performance computing community. In this paper we present RFSA, a reduced function set abstraction based on an existing parallel programming interface (MPI-IO) for I/O. MPI-IO provides high performance I/O function calls to the scientists/engineers writing parallel programs; who are required to use the most appropriate optimization of a specific function, hence limits the programmer productivity. Therefore, we propose a set of reduced functions with an automatic selection algorithm to decide what specific MPI-IO function to use. We implement a selection algorithm for I/O functions like read, write, etc. RFSA replaces 6 different flavors of read and write functions by one read and write function. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show that RFSA functions impose minimal performance penalties.
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lie...
详细信息
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lies in the lack of data flow analysis techniques for programs with user-specified parallelism. The authors demonstrate how data flow analysis remains quite viable in a compiler for shared memory parallel programs in a structured distributed shared memory environment, in which a shared space of tuples is accessed by properly synchronized methods. They demonstrate standard intraprocess data flow analysis performed in the midst of tuplespace communication statements, and present improvements to the precision of the analysis in the presence of these statements. They present a data flow system to compute reaching definitions across process boundaries, and a technique to improve the precision of this interprocess analysis. Lastly, some transformations enabled by this analysis are presented.
The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of...
详细信息
The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of traditional lock-based parallel programming. Hardware transactional memory (HTM) systems implement the necessary mechanisms to provide transactional semantics efficiently. In order to keep hardware simple, current HTM designs apply fixed policies that aim at optimizing the most expected application behaviour, and many of these proposals explicitly assume that commits will be clearly more frequent than aborts in future transactional workloads. This paper shows that some applications developed under the TM programming model are by nature prone to experience many conflicts. As a result, aborted transactions can get to be common and may seriously hurt performance. Our characterization, performed with truly transactional benchmarks on the LogTM system, shows that certain programs composed by large transactions suffer indeed very high abort rates. Thus, if TM is to unburden developers from the programmability-performance trade-off, HTM systems must obtain good performance levels in the presence of frequent aborts, requiring more flexible policies of data versioning as well as more sophisticated recovery schemes.
暂无评论