We present a case for automating the selection of MPI-IO performance optimizations, with an ultimate goal to relieve the application programmer from these details, thereby improving their productivity. Programmers pro...
详细信息
We present a case for automating the selection of MPI-IO performance optimizations, with an ultimate goal to relieve the application programmer from these details, thereby improving their productivity. Programmers productivity has always been overlooked as compared to the performance optimizations in high performance computing community. In this paper we present RFSA, a reduced function set abstraction based on an existing parallel programming interface (MPI-IO) for I/O. MPI-IO provides high performance I/O function calls to the scientists/engineers writing parallel programs; who are required to use the most appropriate optimization of a specific function, hence limits the programmer productivity. Therefore, we propose a set of reduced functions with an automatic selection algorithm to decide what specific MPI-IO function to use. We implement a selection algorithm for I/O functions like read, write, etc. RFSA replaces 6 different flavors of read and write functions by one read and write function. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show that RFSA functions impose minimal performance penalties.
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that ...
详细信息
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach.
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered....
详细信息
ISBN:
(纸本)9781728129877
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered. The functional model for faults handling optimization are proposed.
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of...
详细信息
ISBN:
(纸本)9781450328210
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next *** paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G=(V,E) be a degree-Δ graph colored with Χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)=[Q] + Σv∈Qdeg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of PRISM performs updates in Q using O(Χ(lg (Q/Χ)+lgΔ)+ lgP) span and Θ(size(Q)+Χ+P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2–2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic *** paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level o...
详细信息
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level of abstraction for programmers. This approach uses the pre-defined components to facilitate easier parallel programming. Divide and conquer (DC) is an appropriate parallel pattern for implementation as a skeleton. The solution of the original problem is obtained by dividing it into smaller sub-problems and solving them in parallel. Today, graphics processor unit (GPU) is an attractive computational processor for doing tasks in parallel, because it has a large number of process units. In this paper, divide and conquer skeleton on GPU has been proposed and named OC_***_GPU is a divide and conquer skeleton that is implemented on GPU that using a consistent programming interface in C++ for easier parallel programming. Performance of this skeleton has been evaluated by mergesort and sobeledge detection. The results show that obtained speedup at this skeleton is more than 2 on GPU.
The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is ric...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage coll...
详细信息
ISBN:
(纸本)0780373715
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage collection pauses (stop-the-world mark-sweep algorithm) or inability of collecting cyclic garbage (reference counting approach). Generational garbage collection, however, is based only on the weak generational hypothesis that most objects die young. In this paper, the performance evaluation of a new multithreaded concurrent generational garbage collector (MCGC) based on mark-sweep with the assistance of reference counting is reported. The MCGC can take advantage of multiple CPUs in an SMP system and the merits of lightweight processes. Furthermore, the long garbage collection pause can be reduced and the garbage collection efficiency can be enhanced. Measurement results indicate that the MCGC improves the garbage collection pause time up to 96.75% over the traditional stop-the-world mark-sweep garbage collector. Moreover, the MCGC receives minimal time and space penalties as shown in the report of the total execution time, the memory footprint and the sticky reference count rate.
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been in...
详细信息
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been increased for higher frequencies and the number of parallel executed instructions, to increase the computational bandwidth. To program the parallel units, the VLIW (very long instruction word) has been introduced. programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. To overcome this limitation, this paper proposes a scaleable long instruction word (xLIW). The xLIW concept allows the full usage of the available units in parallel with optimal code density. An instruction buffer included reduces the power dissipation at the program memory ports during loop handling. The xLIW concept is part of a development project of a configurable DSP.
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays acces...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.
暂无评论