Emerging accelerating architectures, such as GPUs, have proved successful in providing significant performance gains to various application domains. This is done by exploiting data parallelism in existing algorithms. ...
详细信息
This paper deals with the issue of developing efficient algorithms for accelerating SIFT (Scale Invariant Feature Transform) features extraction under distributed environment. The proposed distributed dynamic parallel...
详细信息
parallel efficiency is always a fundamental research field in high performance computing. This paper focuses on parallel computing at high performance computing cluster with CASTEP program, discusses multi-core parall...
详细信息
The paper describes-from a software engineering perspective-a framework for the formal development of parallel algorithms on arbitrary architectures. The algorithms are synthesised in a transformational way, i.e. by a...
详细信息
Despite the facts that multicore CPUs are present in virtually every personal computer or cell phone and distributed systems in the form of cloud services are steadily penetrating various domains of our lives, only a ...
详细信息
ISBN:
(纸本)9781538655559
Despite the facts that multicore CPUs are present in virtually every personal computer or cell phone and distributed systems in the form of cloud services are steadily penetrating various domains of our lives, only a minority of programmers and computer science graduates are able to effectively design and develop parallel and distributed applications. Serial thinking is natural to all humans and it is also encouraged by many computer science curricula. Even though that leading educational institutions are attempting to rectify this trend by introducing parallelprogramming courses into their study programs, these courses are often dedicated for more experienced students in their fourth of fifth year since mastering modern parallel technologies like OpenMP or CUDA requires certain level of programming skills. It can be argued, that the parallel thinking should be taught much sooner, perhaps even before tertiary education. To this end, we have created an educational platform Parapple that aims to introduce parallelism and related problems like load balancing or synchronization to inexperienced programmers in an entertaining form. Our platform is web-based, so it can run in any modern browser on all operating systems without installation and the users are required to have only a very basic understanding of structural imperative programming.
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties w...
详细信息
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties when processors are running at different speeds, i.e. correlated by a multiplicative constant factor (our weak definition of heterogeneous platform). One scheme is under study: parallel sorting by sampling (either regular sampling technique introduced by Shi and Schaeffer [J. parallel Distrib. Comput. 14 (4) (1992) 361] or the over-partitioning scheme introduced by Li and Seveik [parallel sorting by over-partitioning, in: Proceedings of the Sixth Annual symposium on parallel algorithms and architectures, ACM Press, New York, June 1994]). What is important in the paper is mainly the load balance factor and not necessary the execution time. It is clear that improved load balance leads to improved execution titre. The results presented in the paper demonstrate that load balancing for the case of computers with heterogeneous processing capacity is more challenging than for the homogeneous case. The survey, through the sorting case study, allow us to identify some algorithmic issues and software challenges to master heterogeneous cluster platforms in order to better utilize theta: data decomposition techniques, scheduling and load balancing methods. (C) 2002 Elsevier Science B.V. All rights reserved.
According to the characteristics of multi-core architectures and binary storage property of integer sequence, this paper proposes an efficient thread-level parallel algorithm for sorting integer sequence on multi-core...
详细信息
The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-b...
详细信息
ISBN:
(纸本)9781665497473
The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with advanced runtime systems allow the programmer to write a sequential algorithm independently of the hardware architecture in a productive and portable manner, and let a third party software layer -the runtime system- deal with the burden of scheduling a correct, parallel execution of that algorithm to ensure performance. Many HPC algorithms have successfully been implemented following this paradigm, as a testimony of its effectiveness. Developing algorithms that specifically require fine-grained tasks along this model is still considered prohibitive, however, due to per-task management overhead [1], forcing the programmer to resort to a less abstract, and hence more complex "task+X" model. We thus investigate the possibility to offer a tailored execution model, trading dynamic mapping for efficiency by using a decentralized, conservative in-order execution of the task flow, while preserving the benefits of relying on the sequential taskbased programming model. We propose a formal specification of the execution model as well as a prototype implementation, which we assess on a shared-memory multicore architecture with several synthetic workloads. The results show that under the condition of a proper task mapping supplied by the programmer, the pressure on the runtime system is significantly reduced and the execution of fine-grained task flows is much more efficient.
The speed of calculating, tracking and filling the isolines has a direct impact on the performance of user interaction. In this paper, we begin with the serial algorithm of visualization and implement its parallel alg...
详细信息
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classi...
详细信息
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classify divide-and-conquer (DC) algorithms and provide a family of provably correct parallel implementations for a particular DC skeleton, called DH (distributable homomorphism);(2) we adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH skeleton and, thereby, obtain a generic SPMD program, well suited for implementation under MPI. The generic program includes the efficient FFT solutions used in practice - the binary-exchange and the 2D- and 3D-transpose implementations - as special cases.
暂无评论