Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to di...
详细信息
Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to distributed-memory machines. A DSM system allows a natural and portable programming model on distributed-memory machines, making it possible to construct a relatively inexpensive and scalable parallel system on which programmers can develop parallel application codes. Due to its potential advantages, DSM has received increasing attention. In this panel, challenges in building efficient DSM systems for a wide range of applications are addressed and discussed.
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We dis...
详细信息
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We discuss some corresponding issues in the context of transforming a shared-memory parallel program that involves two nested levels of parallelism into a hybrid parallel program. Here, hybrid programming refers to a combination of shared and distributed memory. In particular, we focus on performance aspects arising from shared-memory parallel programming where the time to access a memory location varies with the threads. Rather than analyzing these issues in general, the focus of this position paper is on a particular case study from geothermal reservoir engineering.
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model ha...
详细信息
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model has been designed so that it is both easy for architects to implement and stable enough to serve as a target machine interface for compilers of high-level languages. The CRF model exposes a semantic notion of caches (saches), and decomposes load and store instructions into finer-grain operations. We sketch how to integrate CRF into modern microprocessors and outline an adaptive coherence protocol to implement CRF in distributed shared-memory systems. CRF offers an upward compatible way to design next generation computer systems.
parallel computing systems have been based on multicore CPUs and specialized coprocessors, like GPUs. Work-stealing is a scheduling technique that has been used to distribute and redistribute the workload among resour...
详细信息
parallel computing systems have been based on multicore CPUs and specialized coprocessors, like GPUs. Work-stealing is a scheduling technique that has been used to distribute and redistribute the workload among resources in an efficient way. This work aims to propose, implement and validate a scheduling approach based on work stealing in parallel systems with CPUs and GPUs simultaneously. Results show that our approach, called WORMS, presents competitive performance when compared to reference tool for multicore CPUs (Cilk). In hybrid scenario, WORMS with multicore+GPU outperforms WORMS and Cilk with multicore only and also the GPU reference tool (Thrust).
Software architecture for dynamic balancing of calculations in heterogeneous computing clusters has been proposed. Existing systems for automatic and semiautomatic parallelizing are analyzed. Software was implemented ...
详细信息
Software architecture for dynamic balancing of calculations in heterogeneous computing clusters has been proposed. Existing systems for automatic and semiautomatic parallelizing are analyzed. Software was implemented on C++ and used for parallelizing algorithms with different dependencies on data between processors. Comparison of DDCI (dynamic distribution calculations interface) to MPI is given. Conclusions about DDCI advantages and future work directions are made.
The authors develop graph-theoretical techniques using automorphisms to solve the problem of reconfiguring embedded task graphs in faulty hypercubes. Four cases are considered: single node or link failure, small numbe...
详细信息
The authors develop graph-theoretical techniques using automorphisms to solve the problem of reconfiguring embedded task graphs in faulty hypercubes. Four cases are considered: single node or link failure, small number (<5) of random node failures, multiple failures that are adjacent and large number (>or=5) of random node failures. For single node (link) failure, any arbitrary task graphs embedded in a hypercube can be remapped to another instance of that graph by simple bit-flip operations. It takes at most O(n) such operations in an n-dimensional hypercube. For a small number of faults (<5) and multiple adjacent node failures, necessary and sufficient conditions are derived under which the remapping can be achieved with these techniques. It is also shown that the general problem of reconfiguring multiple failures is as hard as the graph isomorphism problem. However, an algorithm is developed to determine if the reconfiguration can be achieved with the graph-theoretical techniques, when the necessary conditions are satisfied.< >
This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. ...
详细信息
This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and Fast Flow. The aim of this paper is to present an unified quantitative and qualitative study of these tools for parallel computation of scientific computing kernels on multicore. Finally, based on this study we conclude that the Intel ArBB and SWARM parallel programming tools are the most appropriate because these give good performance and simplicity of programming.
Analyses three of the most widely used message passing environments for parallel systems, namely Occam/TDS for Transputer based systems, Vertex-Vortex for nCUBE, and Express, which is an architecture-independent messa...
详细信息
Analyses three of the most widely used message passing environments for parallel systems, namely Occam/TDS for Transputer based systems, Vertex-Vortex for nCUBE, and Express, which is an architecture-independent message passing system. The aim of our analysis is to contrast the features provided by the different systems, and from among them to select the most suitable one with respect to some criteria, i.e. program portability, efficiency and flexibility of the message passing system. Program portability should ensure that we will be able to use tomorrow, on new parallel architectures, today's parallel software, thus preserving our investment. Efficiency of the application is a must for parallel software. With flexibility, we intend that the message passing system is able to accommodate different parallel programming paradigms (such as pipelined, data parallel) in a natural way. The use of a general-purpose message passing system, like those considered, is also contrasted with the use of an application-oriented high-level notation.< >
Designing reconfigurable systems that beneficially exploit the spatial and temporal domain is a cumbersome task hardly supported by current design methods. In particular, if we aim to bridge the gap between applicatio...
详细信息
Designing reconfigurable systems that beneficially exploit the spatial and temporal domain is a cumbersome task hardly supported by current design methods. In particular, if we aim to bridge the gap between application and reconfigurable substrate, we require concrete concepts that allow for utilizing the inherent parallelism and adaptiveness of re- configurable devices. We propose algorithmic skeletons as sophisticated technique therefore. Algorithmic skeletons are programming templates for the parallel computing domain and therefore separate the structure of a computation from the computation itself. Hence, they offer a seminal means to extract temporal and spatial characteristics of an application, which can be used to make reconfigurability explicit. In this work, we show the conceptual background as well as a concrete implementation means of the method.
In distributed Java environments, locality of objects and threads is crucial for the performance of parallel applications. We introduce dynamic locality optimizations in the context of JavaParty, a programming and run...
详细信息
In distributed Java environments, locality of objects and threads is crucial for the performance of parallel applications. We introduce dynamic locality optimizations in the context of JavaParty, a programming and runtime environment for parallel Java applications. Until now, an optimal distribution of the individual objects of an application has to be found manually, which has several drawbacks. Based on a former static approach, we develop a dynamic methodology for automatic locality optimizations. By measuring processing and communication times of remote method calls at runtime, a placement strategy can be computed that maps each object of the distributed system to its optimal virtual machine. Objects then are migrated between the processing nodes in order to realize this placement strategy. We evaluate our approach by comparing the performance of two benchmark applications with manually distributed versions. It is shown that our approach is particularly suitable for dynamic applications where the optimal object distribution varies at runtime.
暂无评论