programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by ...
详细信息
programming a distributed memory parallel machine generally entails a high degree of complexity. Load balancing in particular is a demanding task. If high efficiency is to be maintained, this task cannot be solved by a distributed operating system alone, but must involve the application programmer. Instead of the underlying message passing architecture being shielded from the programmer, it should be explicitly modeled. Three key concepts of a parallel operating system-dual, mobile and reactive objects-are presented. They provide simple but efficient mechanisms that can be easily utilized for such complex tasks as load balancing, i.e., initial placement and migration of application entities. To illustrate the applicability of these concepts, a simple VR application-geoview-was implemented on a message passing architecture, and serves as an example throughout the paper.
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lie...
详细信息
A current limitation of compilers for shared memory parallel languages is their restricted use of traditional code-improving transformations, such as constant propagation and dead code elimination. A major problem lies in the lack of data flow analysis techniques for programs with user-specified parallelism. The authors demonstrate how data flow analysis remains quite viable in a compiler for shared memory parallel programs in a structured distributed shared memory environment, in which a shared space of tuples is accessed by properly synchronized methods. They demonstrate standard intraprocess data flow analysis performed in the midst of tuplespace communication statements, and present improvements to the precision of the analysis in the presence of these statements. They present a data flow system to compute reaching definitions across process boundaries, and a technique to improve the precision of this interprocess analysis. Lastly, some transformations enabled by this analysis are presented.
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of...
详细信息
ISBN:
(纸本)9781450328210
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next *** paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G=(V,E) be a degree-Δ graph colored with Χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)=[Q] + Σv∈Qdeg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of PRISM performs updates in Q using O(Χ(lg (Q/Χ)+lgΔ)+ lgP) span and Θ(size(Q)+Χ+P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2–2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic *** paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered....
详细信息
ISBN:
(纸本)9781728129877
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered. The functional model for faults handling optimization are proposed.
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level o...
详细信息
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level of abstraction for programmers. This approach uses the pre-defined components to facilitate easier parallel programming. Divide and conquer (DC) is an appropriate parallel pattern for implementation as a skeleton. The solution of the original problem is obtained by dividing it into smaller sub-problems and solving them in parallel. Today, graphics processor unit (GPU) is an attractive computational processor for doing tasks in parallel, because it has a large number of process units. In this paper, divide and conquer skeleton on GPU has been proposed and named OC_***_GPU is a divide and conquer skeleton that is implemented on GPU that using a consistent programming interface in C++ for easier parallel programming. Performance of this skeleton has been evaluated by mergesort and sobeledge detection. The results show that obtained speedup at this skeleton is more than 2 on GPU.
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that ...
详细信息
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach.
The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is ric...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage coll...
详细信息
ISBN:
(纸本)0780373715
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage collection pauses (stop-the-world mark-sweep algorithm) or inability of collecting cyclic garbage (reference counting approach). Generational garbage collection, however, is based only on the weak generational hypothesis that most objects die young. In this paper, the performance evaluation of a new multithreaded concurrent generational garbage collector (MCGC) based on mark-sweep with the assistance of reference counting is reported. The MCGC can take advantage of multiple CPUs in an SMP system and the merits of lightweight processes. Furthermore, the long garbage collection pause can be reduced and the garbage collection efficiency can be enhanced. Measurement results indicate that the MCGC improves the garbage collection pause time up to 96.75% over the traditional stop-the-world mark-sweep garbage collector. Moreover, the MCGC receives minimal time and space penalties as shown in the report of the total execution time, the memory footprint and the sticky reference count rate.
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays acces...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.
暂无评论