The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of...
详细信息
The difficulty of multithreaded programming remains a major obstacle for programmers to fully exploit multicore chips. Transactional memory has been proposed as an abstraction capable of ameliorating the challenges of traditional lock-based parallel programming. Hardware transactional memory (HTM) systems implement the necessary mechanisms to provide transactional semantics efficiently. In order to keep hardware simple, current HTM designs apply fixed policies that aim at optimizing the most expected application behaviour, and many of these proposals explicitly assume that commits will be clearly more frequent than aborts in future transactional workloads. This paper shows that some applications developed under the TM programming model are by nature prone to experience many conflicts. As a result, aborted transactions can get to be common and may seriously hurt performance. Our characterization, performed with truly transactional benchmarks on the LogTM system, shows that certain programs composed by large transactions suffer indeed very high abort rates. Thus, if TM is to unburden developers from the programmability-performance trade-off, HTM systems must obtain good performance levels in the presence of frequent aborts, requiring more flexible policies of data versioning as well as more sophisticated recovery schemes.
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms ...
详细信息
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms developed so far; indeed all of those are based upon a video scanning of the image, and are thus sequential by nature. Therefore, a new data parallel algorithm has been designed. The principle of the data parallel implementation is detailed. The implementation technique is analogous to the parallel region growing algorithm.
We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on exis...
详细信息
We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on existing homogeneous algorithms like orthogonal recursive bisection, parallel strips, and scattering. We apply these algorithms to a parallel numerical application in a network of heterogeneous workstations. The behavior of the individual methods in a system with dynamical load changes and heterogeneous nodes is investigated. In addition, the new methods are compared with the conventional methods for homogeneous partitioning.< >
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are prese...
详细信息
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are presented from a knowledge engineering perspective and are divided into four levels: acquisition; representation; structuring; and design. The acquisition level entails the methods for eliciting system requirements data (attributes and relationships of software entities) from the end-user group using a model of context classes. The representation level deals with the language paradigm for representing the attributes and relationships of the software entities. The structuring level addresses methods for rearranging and grouping the software objects of the context classes into related clusters. The design level deals with methods for mapping or transforming the clusters of software objects into specification modules to facilitate distributed design. To this end, the design level uses an object-based paradigm for specifying the attributes and abstract behavior of the objects within the modules.< >
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and netwo...
详细信息
ISBN:
(纸本)9780818670886
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and sub-optimal schedules of loops for load balancing, communication optimizations and network contention. Experiments show that a significant improvement of performance is achieved using our techniques.
Efforts to support high performance computing (HPC) applications' requirements in the context of cloud computing have motivated us to design HPC Shelf, a cloud computing services platform to build and deploy large...
详细信息
Efforts to support high performance computing (HPC) applications' requirements in the context of cloud computing have motivated us to design HPC Shelf, a cloud computing services platform to build and deploy large-scale parallel computing systems. We introduce Alite, the contextual contract system of HPC Shelf, to select component implementations according to requirements of the host application, target parallel computing platform characteristics (e.g., clusters and MPPs), quality of service (QoS) properties, and cost restrictions. It is evaluated through a small-scale case study employing two complementary component-based frameworks. The first one aims to represent components that implement linear algebra computations based on the BLAS interface. In turn, the second one aims to represent parallel computing platforms on the IaaS cloud offered by Amazon EC2 Service.
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is ric...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been in...
详细信息
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been increased for higher frequencies and the number of parallel executed instructions, to increase the computational bandwidth. To program the parallel units, the VLIW (very long instruction word) has been introduced. programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. To overcome this limitation, this paper proposes a scaleable long instruction word (xLIW). The xLIW concept allows the full usage of the available units in parallel with optimal code density. An instruction buffer included reduces the power dissipation at the program memory ports during loop handling. The xLIW concept is part of a development project of a configurable DSP.
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays acces...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.
暂无评论