The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of...
详细信息
The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of traditional lock-based parallel programming. Hardware transactional memory (HTM) systems implement the necessary mechanisms to provide transactional semantics efficiently. In order to keep hardware simple, current HTM designs apply fixed policies that aim at optimizing the most expected application behaviour, and many of these proposals explicitly assume that commits will be clearly more frequent than aborts in future transactional workloads. This paper shows that some applications developed under the TM programming model are by nature prone to experience many conflicts. As a result, aborted transactions can get to be common and may seriously hurt performance. Our characterization, performed with truly transactional benchmarks on the LogTM system, shows that certain programs composed by large transactions suffer indeed very high abort rates. Thus, if TM is to unburden developers from the programmability-performance trade-off, HTM systems must obtain good performance levels in the presence of frequent aborts, requiring more flexible policies of data versioning as well as more sophisticated recovery schemes.
We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on exis...
详细信息
We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on existing homogeneous algorithms like orthogonal recursive bisection, parallel strips, and scattering. We apply these algorithms to a parallel numerical application in a network of heterogeneous workstations. The behavior of the individual methods in a system with dynamical load changes and heterogeneous nodes is investigated. In addition, the new methods are compared with the conventional methods for homogeneous partitioning.< >
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and netwo...
详细信息
ISBN:
(纸本)9780818670886
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and sub-optimal schedules of loops for load balancing, communication optimizations and network contention. Experiments show that a significant improvement of performance is achieved using our techniques.
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms ...
详细信息
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms developed so far; indeed all of those are based upon a video scanning of the image, and are thus sequential by nature. Therefore, a new data parallel algorithm has been designed. The principle of the data parallel implementation is detailed. The implementation technique is analogous to the parallel region growing algorithm.
The behavioral correctness of parallel programs has a pivotal role in computational sciences and engineering applications as researchers draw scientific conclusions from the results generated by parallel applications....
详细信息
ISBN:
(纸本)9781424437184
The behavioral correctness of parallel programs has a pivotal role in computational sciences and engineering applications as researchers draw scientific conclusions from the results generated by parallel applications. Moreover, with the advent of multicore processors, the development of parallel programs should be facilitated for the mainstream developers. While numerous programming models and APIs exist for parallel programming, we pose the view that more emphasis should be placed on designing the synchronization mechanisms of parallel programs independent from the design of their functional behaviors. More importantly, programs behaviors evolve (due to new requirements and change of configuration), thereby creating a need for techniques and tools that enable developers to reason about the behavioral evolution of parallel programs. With such motivations, we introduce a framework for automated design/evolution of the synchronization mechanisms of parallel programs.
Rendering, in particular the computation of global illumination, uses computationally very demanding algorithms. As a consequence many researchers have looked into speeding up the computation by distributing it over a...
详细信息
Rendering, in particular the computation of global illumination, uses computationally very demanding algorithms. As a consequence many researchers have looked into speeding up the computation by distributing it over a number of computational units. However, in almost all cases did they completely redesign the relevant algorithms in order to achieve high efficiency for the particular distributed or parallel environment. At the same time global illumination algorithms have become more and more sophisticated and complex. Often several basic algorithms are combined in multi-pass arrangements to achieve the desired lighting effects. As a result, it is becoming increasingly difficult to analyze and adapt the algorithms for optimal parallel execution at the lower levels. Furthermore, these bottom-up approaches destroy the basic design of an algorithm by polluting it with distribution logic and thus easily making it unmaintainable. We present a top-down approach for designing distributed applications based on their existing object-oriented decomposition. Distribution logic, in our case based on the CORBA middleware standard, is introduced transparently to the existing application logic. The design approach is demonstrated using several examples of multi-pass global illumination computation and ray tracing. The results show that a good speedup can usually be obtained even with minimal intervention into existing applications.
The ldquomain-streamrdquo inter-process communication models (share-memory and message-passing) require the programmers responsible for the construction of a very complex state machine for parallel processing. This ha...
详细信息
The ldquomain-streamrdquo inter-process communication models (share-memory and message-passing) require the programmers responsible for the construction of a very complex state machine for parallel processing. This has resulted multiple difficulties including programming, performance tuning, debugging, job scheduling and fault tolerance. The most troubling is the degree of difficulties. It increases exponentially as the multiprocessor grows in size. Inspired by the successes of packet switching protocols, this paper reports our preliminary findings in using decoupling technologies for parallel applications with high reliability, high performance and programmability.
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lie...
详细信息
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lies in the lack of data flow analysis techniques for programs with user-specified parallelism. The authors demonstrate how data flow analysis remains quite viable in a compiler for shared memory parallel programs in a structured distributed shared memory environment, in which a shared space of tuples is accessed by properly synchronized methods. They demonstrate standard intraprocess data flow analysis performed in the midst of tuplespace communication statements, and present improvements to the precision of the analysis in the presence of these statements. They present a data flow system to compute reaching definitions across process boundaries, and a technique to improve the precision of this interprocess analysis. Lastly, some transformations enabled by this analysis are presented.
programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by ...
详细信息
programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by a distributed operating system alone, but must involve the application programmer. Instead of the underlying message passing architecture being shielded from the programmer, it should be explicitly modeled. Three key concepts of a parallel operating system-dual, mobile and reactive objects-are presented. They provide simple but efficient mechanisms that can be easily utilized for such complex tasks as load balancing, i.e., initial placement and migration of application entities. To illustrate the applicability of these concepts, a simple VR application-geoview-was implemented on a message passing architecture, and serves as an example throughout the paper.
暂无评论