With the multi-core technology be widely used in PCs, multithreading method becomes the most efficient programming tool for improving the computation capacity on a single processor. Especially in image processing, mos...
详细信息
With the multi-core technology be widely used in PCs, multithreading method becomes the most efficient programming tool for improving the computation capacity on a single processor. Especially in image processing, most numerical approaches are based on the computing over the pixel matrices. As the image size increases, more time consuming is needed. In this paper, a parallel method to perform the PDE-based image registration computing is discussed, the multithreading process with OpenMP on a dual core computer is detailed. Some experimental results show that this method can produce the large size parallel image registration and save nearly a half computing time.
Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. parallelism is another orthogonal scalability path that efficiently overcomes the ...
详细信息
ISBN:
(纸本)9781424479535;9781424479528
Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. parallelism is another orthogonal scalability path that efficiently overcomes the increasing limitation of frequency scaling in current general-purpose architectures. In this paper we propose a multi-accelerator architecture that combines the best of both worlds, parallelism and custom acceleration, while addressing the programmability inconvenience of heterogeneous multiprocessing systems. A Chip Multi-Accelerator (CMA) is a regular parallel architecture where each core is complemented with a custom accelerator to speed up specific functions. Furthermore, by using techniques to efficiently merge more than one custom accelerator together, we are able to cram as many accelerators as needed by the application or a domain of applications. We demonstrate our approach on a Software Defined Radio (SDR) case study. We show that starting from a baseline description of several SDR waveforms and candidate tasks for acceleration, we are able to map the different waveforms on the heterogeneous multi-accelerator architecture while keeping a logical view of a regular multi-core architecture, thus simplifying the mapping of the waveforms onto the multi-accelerator.
Irregular applications, which rely on pointer-based data structures, are often difficult to parallelize. The input-dependent nature of their execution means that traditional parallelization techniques are unable to ex...
详细信息
Irregular applications, which rely on pointer-based data structures, are often difficult to parallelize. The input-dependent nature of their execution means that traditional parallelization techniques are unable to exploit any latent parallelism in these algorithms. Instead, we turn to optimistic parallelism, where regions of code are speculatively run in parallel while runtime mechanisms ensure proper execution. The performance of such optimistically parallelized algorithms is often dependent on the schedule for parallel execution; improper choices can prevent successful parallel execution. We demonstrate this through the motivating example of Delaunay mesh refinement, an irregular algorithm, which we have parallelized optimistically using the Galois system. We apply several scheduling policies to this algorithm and investigate their performance, showing that careful consideration of scheduling is necessary to maximize parallel performance.
Cloud computing has gained significant traction in recent years. The Map-Reduce framework is currently the most dominant programming model in cloud computing settings. In this paper, we describe Granules, a lightweigh...
详细信息
Cloud computing has gained significant traction in recent years. The Map-Reduce framework is currently the most dominant programming model in cloud computing settings. In this paper, we describe Granules, a lightweight, streaming-based runtime for cloud computing which incorporates support for the Map-Reduce framework. Granules provides rich lifecycle support for developing scientific applications with support for iterative, periodic and data driven semantics for individual computations and pipelines. We describe our support for variants of the Map-Reduce framework. The paper presents a survey of related work in this area. Finally, this paper describes our performance evaluation of various aspects of the system, including (where possible) comparisons with other comparable systems.
Explores the role of learning techniques motivated by the approaches in artificial intelligence in the context of improving parallel program performance. The authors present an adaptive control model to improve the pa...
详细信息
ISBN:
(纸本)0852965095
Explores the role of learning techniques motivated by the approaches in artificial intelligence in the context of improving parallel program performance. The authors present an adaptive control model to improve the parallel program performance through dynamic modification of scheduling parameters under various run-time environments. Optimal or improved scheduling strategies learned from previous program executions provide feedback to further program executions.< >
Software generation in the OORHS (object-oriented reciprocative hypercomputing system) is user-transparent. It addresses the issue of ease of use by minimizing the number of steps leading to a programming solution. Th...
详细信息
Software generation in the OORHS (object-oriented reciprocative hypercomputing system) is user-transparent. It addresses the issue of ease of use by minimizing the number of steps leading to a programming solution. The OORHS requires from the user only a high level APPL program, which is, in effect, a specification. For every APPL program, the system automatically performs all the necessary distributed computing steps. The precompiler, based on the object-oriented paradigm, instantiates the encapsulated program objects embedded in an APPL program. These program objects are distributed at the source level. They are compiled and then executed at the allocated sites. This unique approach, known as local compilation, eliminates the need to store the compilers used by other machines locally. It enhances the compatibility between the compiled program and the host processor. The precompiler generates a program objects dictionary for every APPL program. The contents in the dictionary facilitates program visualization.
We present the GRIDS programming system for parallel computations on unstructured grids. The need for adaptive parallelization is discussed. It takes into account the specific reasons for the degradation of parallel e...
详细信息
We present the GRIDS programming system for parallel computations on unstructured grids. The need for adaptive parallelization is discussed. It takes into account the specific reasons for the degradation of parallel efficiency in parallel systems ranging from workstation clusters to parallel supercomputers. Implications are shown on the design of programming models for unified parallel and distributed computing. Some strategies for adaptive optimization that are integrated in the GRIDS system are presented in detail. The effects of balancing the load bundling messages, and dynamically reordering the operations are analysed in the general context of computations on unstructured grids. Performance measurements show the impact on the parallel efficiency of these schemes on a workstation cluster and a parallel computer.< >
This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for...
详细信息
This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution of pages incur modest performance losses. We also show that performance leaks stemming from suboptimal page placement schemes can be remedied with a smart user-level page migration engine. The main body of the paper describes how the OpenMP runtime environment can use page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results support the effectiveness of these mechanisms and provide a proof of concept that there is no need to introduce data distribution directives in OpenMP and warrant the portability of the programming model.
The authors describe the design and implementation of C40PVM, a PVM runtime environment for TMS320C40 networks. With our C40PVM runtime environment, parallel applications can then be easily developed on C40 systems an...
详细信息
The authors describe the design and implementation of C40PVM, a PVM runtime environment for TMS320C40 networks. With our C40PVM runtime environment, parallel applications can then be easily developed on C40 systems and be ported over to other parallel computing platforms. The performance of our runtime environment is also analyzed by using a DSP application on vector quantization.
暂无评论