With the multi-core technology be widely used in PCs, multithreading method becomes the most efficient programming tool for improving the computation capacity on a single processor. Especially in image processing, mos...
详细信息
With the multi-core technology be widely used in PCs, multithreading method becomes the most efficient programming tool for improving the computation capacity on a single processor. Especially in image processing, most numerical approaches are based on the computing over the pixel matrices. As the image size increases, more time consuming is needed. In this paper, a parallel method to perform the PDE-based image registration computing is discussed, the multithreading process with OpenMP on a dual core computer is detailed. Some experimental results show that this method can produce the large size parallel image registration and save nearly a half computing time.
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model ha...
详细信息
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model has been designed so that it is both easy for architects to implement and stable enough to serve as a target machine interface for compilers of high-level languages. The CRF model exposes a semantic notion of caches (saches), and decomposes load and store instructions into finer-grain operations. We sketch how to integrate CRF into modern microprocessors and outline an adaptive coherence protocol to implement CRF in distributed shared-memory systems. CRF offers an upward compatible way to design next generation computer systems.
Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to di...
详细信息
Distributed shared memory (DSM) systems could overcome major obstacles of the widespread use of distributed-memory multiprocessors, while retaining the attractive features of low cost and good scalability common to distributed-memory machines. A DSM system allows a natural and portable programming model on distributed-memory machines, making it possible to construct a relatively inexpensive and scalable parallel system on which programmers can develop parallel application codes. Due to its potential advantages, DSM has received increasing attention. In this panel, challenges in building efficient DSM systems for a wide range of applications are addressed and discussed.
parallel computing systems have been based on multicore CPUs and specialized coprocessors, like GPUs. Work-stealing is a scheduling technique that has been used to distribute and redistribute the workload among resour...
详细信息
parallel computing systems have been based on multicore CPUs and specialized coprocessors, like GPUs. Work-stealing is a scheduling technique that has been used to distribute and redistribute the workload among resources in an efficient way. This work aims to propose, implement and validate a scheduling approach based on work stealing in parallel systems with CPUs and GPUs simultaneously. Results show that our approach, called WORMS, presents competitive performance when compared to reference tool for multicore CPUs (Cilk). In hybrid scenario, WORMS with multicore+GPU outperforms WORMS and Cilk with multicore only and also the GPU reference tool (Thrust).
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We dis...
详细信息
While the noticeable shift from serial to parallel programming in simulation technologies progresses, it is increasingly important to better understand the interplay of different parallel programming paradigms. We discuss some corresponding issues in the context of transforming a shared-memory parallel program that involves two nested levels of parallelism into a hybrid parallel program. Here, hybrid programming refers to a combination of shared and distributed memory. In particular, we focus on performance aspects arising from shared-memory parallel programming where the time to access a memory location varies with the threads. Rather than analyzing these issues in general, the focus of this position paper is on a particular case study from geothermal reservoir engineering.
Software generation in the OORHS (object-oriented reciprocative hypercomputing system) is user-transparent. It addresses the issue of ease of use by minimizing the number of steps leading to a programming solution. Th...
详细信息
Software generation in the OORHS (object-oriented reciprocative hypercomputing system) is user-transparent. It addresses the issue of ease of use by minimizing the number of steps leading to a programming solution. The OORHS requires from the user only a high level APPL program, which is, in effect, a specification. For every APPL program, the system automatically performs all the necessary distributed computing steps. The precompiler, based on the object-oriented paradigm, instantiates the encapsulated program objects embedded in an APPL program. These program objects are distributed at the source level. They are compiled and then executed at the allocated sites. This unique approach, known as local compilation, eliminates the need to store the compilers used by other machines locally. It enhances the compatibility between the compiled program and the host processor. The precompiler generates a program objects dictionary for every APPL program. The contents in the dictionary facilitates program visualization.
This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for...
详细信息
This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution of pages incur modest performance losses. We also show that performance leaks stemming from suboptimal page placement schemes can be remedied with a smart user-level page migration engine. The main body of the paper describes how the OpenMP runtime environment can use page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results support the effectiveness of these mechanisms and provide a proof of concept that there is no need to introduce data distribution directives in OpenMP and warrant the portability of the programming model.
We present the GRIDS programming system for parallel computations on unstructured grids. The need for adaptive parallelization is discussed. It takes into account the specific reasons for the degradation of parallel e...
详细信息
We present the GRIDS programming system for parallel computations on unstructured grids. The need for adaptive parallelization is discussed. It takes into account the specific reasons for the degradation of parallel efficiency in parallel systems ranging from workstation clusters to parallel supercomputers. Implications are shown on the design of programming models for unified parallel and distributed computing. Some strategies for adaptive optimization that are integrated in the GRIDS system are presented in detail. The effects of balancing the load bundling messages, and dynamically reordering the operations are analysed in the general context of computations on unstructured grids. Performance measurements show the impact on the parallel efficiency of these schemes on a workstation cluster and a parallel computer.< >
Explores the role of learning techniques motivated by the approaches in artificial intelligence in the context of improving parallel program performance. The authors present an adaptive control model to improve the pa...
详细信息
ISBN:
(纸本)0852965095
Explores the role of learning techniques motivated by the approaches in artificial intelligence in the context of improving parallel program performance. The authors present an adaptive control model to improve the parallel program performance through dynamic modification of scheduling parameters under various run-time environments. Optimal or improved scheduling strategies learned from previous program executions provide feedback to further program executions.< >
暂无评论