Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to di...
详细信息
Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to distributed-memory machines. A DSM system allows a natural and portable programming model on distributed-memory machines, making it possible to construct a relatively inexpensive and scalable parallel system on which programmers can develop parallel application codes. Due to its potential advantages, DSM has received increasing attention. In this panel, challenges in building efficient DSM systems for a wide range of applications are addressed and discussed.
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We dis...
详细信息
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We discuss some corresponding issues in the context of transforming a shared-memory parallel program that involves two nested levels of parallelism into a hybrid parallel program. Here, hybrid programming refers to a combination of shared and distributed memory. In particular, we focus on performance aspects arising from shared-memory parallel programming where the time to access a memory location varies with the threads. Rather than analyzing these issues in general, the focus of this position paper is on a particular case study from geothermal reservoir engineering.
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model ha...
详细信息
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model has been designed so that it is both easy for architects to implement and stable enough to serve as a target machine interface for compilers of high-level languages. The CRF model exposes a semantic notion of caches (saches), and decomposes load and store instructions into finer-grain operations. We sketch how to integrate CRF into modern microprocessors and outline an adaptive coherence protocol to implement CRF in distributed shared-memory systems. CRF offers an upward compatible way to design next generation computer systems.
Software architecture for dynamic balancing of calculations in heterogeneous computing clusters has been proposed. Existing systems for automatic and semiautomatic parallelizing are analyzed. Software was implemented ...
详细信息
Software architecture for dynamic balancing of calculations in heterogeneous computing clusters has been proposed. Existing systems for automatic and semiautomatic parallelizing are analyzed. Software was implemented on C++ and used for parallelizing algorithms with different dependencies on data between processors. Comparison of DDCI (dynamic distribution calculations interface) to MPI is given. Conclusions about DDCI advantages and future work directions are made.
The authors develop graph-theoretical techniques using automorphisms to solve the problem of reconfiguring embedded task graphs in faulty hypercubes. Four cases are considered: single node or link failure, small numbe...
详细信息
The authors develop graph-theoretical techniques using automorphisms to solve the problem of reconfiguring embedded task graphs in faulty hypercubes. Four cases are considered: single node or link failure, small number (<5) of random node failures, multiple failures that are adjacent and large number (>or=5) of random node failures. For single node (link) failure, any arbitrary task graphs embedded in a hypercube can be remapped to another instance of that graph by simple bit-flip operations. It takes at most O(n) such operations in an n-dimensional hypercube. For a small number of faults (<5) and multiple adjacent node failures, necessary and sufficient conditions are derived under which the remapping can be achieved with these techniques. It is also shown that the general problem of reconfiguring multiple failures is as hard as the graph isomorphism problem. However, an algorithm is developed to determine if the reconfiguration can be achieved with the graph-theoretical techniques, when the necessary conditions are satisfied.< >
This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. ...
详细信息
This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and Fast Flow. The aim of this paper is to present an unified quantitative and qualitative study of these tools for parallel computation of scientific computing kernels on multicore. Finally, based on this study we conclude that the Intel ArBB and SWARM parallel programming tools are the most appropriate because these give good performance and simplicity of programming.
Analyses three of the most widely used message passing environments for parallel systems, namely Occam/TDS for Transputer based systems, Vertex-Vortex for nCUBE, and Express, which is an architecture-independent messa...
详细信息
Analyses three of the most widely used message passing environments for parallel systems, namely Occam/TDS for Transputer based systems, Vertex-Vortex for nCUBE, and Express, which is an architecture-independent message passing system. The aim of our analysis is to contrast the features provided by the different systems, and from among them to select the most suitable one with respect to some criteria, i.e. program portability, efficiency and flexibility of the message passing system. Program portability should ensure that we will be able to use tomorrow, on new parallel architectures, today's parallel software, thus preserving our investment. Efficiency of the application is a must for parallel software. With flexibility, we intend that the message passing system is able to accommodate different parallel programming paradigms (such as pipelined, data parallel) in a natural way. The use of a general-purpose message passing system, like those considered, is also contrasted with the use of an application-oriented high-level notation.< >
Designing reconfigurable systems that beneficially exploit the spatial and temporal domain is a cumbersome task hardly supported by current design methods. In particular, if we aim to bridge the gap between applicatio...
详细信息
Designing reconfigurable systems that beneficially exploit the spatial and temporal domain is a cumbersome task hardly supported by current design methods. In particular, if we aim to bridge the gap between application and reconfigurable substrate, we require concrete concepts that allow for utilizing the inherent parallelism and adaptiveness of re- configurable devices. We propose algorithmic skeletons as sophisticated technique therefore. Algorithmic skeletons are programming templates for the parallel computing domain and therefore separate the structure of a computation from the computation itself. Hence, they offer a seminal means to extract temporal and spatial characteristics of an application, which can be used to make reconfigurability explicit. In this work, we show the conceptual background as well as a concrete implementation means of the method.
With the progress of semiconductor technologies and the advent of multi-core processor, parallel programming models are evolving and the education is needed to help sequential programmers adapt to the requirements of ...
详细信息
With the progress of semiconductor technologies and the advent of multi-core processor, parallel programming models are evolving and the education is needed to help sequential programmers adapt to the requirements of those new technologies and architectures. Now multi-core related contents have been adopted into curricula syllabus of more than 100 universities in China, but how those contents be organized and delivered to students are still a big challenge. In this paper, we present the current status of multi-core education in China and try to divide related contents into several parts, we also introduce "contracted Problem/Project Based Learning (cP 2 BL)" strategy that have been adopted into teaching curricula "Multi-core Architecture and Multithreaded programming Technologies", which runs well in Wuhan University.
The GPUs (Graphics Processing Units) have evolved into extremely powerful and flexible processors, allowing its usage for processing different data. This advantage can be used in game development to optimize the game ...
详细信息
The GPUs (Graphics Processing Units) have evolved into extremely powerful and flexible processors, allowing its usage for processing different data. This advantage can be used in game development to optimize the game loop. Most GPGPU works deals only with some steps of the game loop, allowing to the CPU to process most of the game logic. This work differ from the traditional approach, by presenting and implementing practically the entire game loop inside the GPU. This is a big breakthrough on game development, since the CPUs are evolving to multi-core, and future games will need similar parallelism as the GPUs programs.
暂无评论