We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on exis...
详细信息
We present several heterogeneous partitioning algorithms for parallel numerical applications. The goal is to adapt the partitioning to dynamic and unpredictable load changes on the nodes. The methods are based on existing homogeneous algorithms like orthogonal recursive bisection, parallel strips, and scattering. We apply these algorithms to a parallel numerical application in a network of heterogeneous workstations. The behavior of the individual methods in a system with dynamical load changes and heterogeneous nodes is investigated. In addition, the new methods are compared with the conventional methods for homogeneous partitioning.< >
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are prese...
详细信息
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are presented from a knowledge engineering perspective and are divided into four levels: acquisition; representation; structuring; and design. The acquisition level entails the methods for eliciting system requirements data (attributes and relationships of software entities) from the end-user group using a model of context classes. The representation level deals with the language paradigm for representing the attributes and relationships of the software entities. The structuring level addresses methods for rearranging and grouping the software objects of the context classes into related clusters. The design level deals with methods for mapping or transforming the clusters of software objects into specification modules to facilitate distributed design. To this end, the design level uses an object-based paradigm for specifying the attributes and abstract behavior of the objects within the modules.< >
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms ...
详细信息
The paper focuses on the problem of the implementation under the data parallel programming model of an edge point chaining algorithm. This implementation is not a straightforward transposition of classical algorithms developed so far; indeed all of those are based upon a video scanning of the image, and are thus sequential by nature. Therefore, a new data parallel algorithm has been designed. The principle of the data parallel implementation is detailed. The implementation technique is analogous to the parallel region growing algorithm.
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and netwo...
详细信息
ISBN:
(纸本)9780818670886
In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and sub-optimal schedules of loops for load balancing, communication optimizations and network contention. Experiments show that a significant improvement of performance is achieved using our techniques.
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that ...
详细信息
In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach.
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered....
详细信息
ISBN:
(纸本)9781728129877
This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered. The functional model for faults handling optimization are proposed.
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of...
详细信息
ISBN:
(纸本)9781450328210
A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next *** paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G=(V,E) be a degree-Δ graph colored with Χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)=[Q] + Σv∈Qdeg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of PRISM performs updates in Q using O(Χ(lg (Q/Χ)+lgΔ)+ lgP) span and Θ(size(Q)+Χ+P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2–2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic *** paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level o...
详细信息
parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level of abstraction for programmers. This approach uses the pre-defined components to facilitate easier parallel programming. Divide and conquer (DC) is an appropriate parallel pattern for implementation as a skeleton. The solution of the original problem is obtained by dividing it into smaller sub-problems and solving them in parallel. Today, graphics processor unit (GPU) is an attractive computational processor for doing tasks in parallel, because it has a large number of process units. In this paper, divide and conquer skeleton on GPU has been proposed and named OC_***_GPU is a divide and conquer skeleton that is implemented on GPU that using a consistent programming interface in C++ for easier parallel programming. Performance of this skeleton has been evaluated by mergesort and sobeledge detection. The results show that obtained speedup at this skeleton is more than 2 on GPU.
The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic...
详细信息
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been in...
详细信息
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been increased for higher frequencies and the number of parallel executed instructions, to increase the computational bandwidth. To program the parallel units, the VLIW (very long instruction word) has been introduced. programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. To overcome this limitation, this paper proposes a scaleable long instruction word (xLIW). The xLIW concept allows the full usage of the available units in parallel with optimal code density. An instruction buffer included reduces the power dissipation at the program memory ports during loop handling. The xLIW concept is part of a development project of a configurable DSP.
暂无评论