Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix ...
详细信息
ISBN:
(纸本)9781479986484
Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices.
Understanding three-dimensional seismic wave propagation in complex media is still one of the main challenges of quantitative seismology. Because of its simplicity and numerical efficiency, the finite-differences meth...
详细信息
ISBN:
(纸本)9781467380119
Understanding three-dimensional seismic wave propagation in complex media is still one of the main challenges of quantitative seismology. Because of its simplicity and numerical efficiency, the finite-differences method is one of the standard techniques implemented to consider the elastodynamics equation. Additionally, this class of modeling heavily relies on parallel architectures in order to tackle large scale geometries including a detailed description of the physics. Last decade, significant efforts have been devoted towards efficient implementation of the finite-differences methods on emerging architectures. These contributions have demonstrated their efficiency leading to robust industrial applications. The growing representation of heterogeneous architectures combining general purpose multicore platforms and accelerators leads to re-design current parallel application. In this paper, we consider StarPU task-based runtime system in order to harness the power of heterogeneous CPU+GPU computing nodes. We detail our implementation and compare the performance obtained with the classical CPU or GPU only versions. Preliminary results demonstrate significant speedups in comparison with the best implementation suitable for homogeneous cores.
task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, a...
详细信息
ISBN:
(纸本)9781450337236
task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency. In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average in crease of 15% in misses over the baseline.
Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for ...
详细信息
Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for these problems is the reduction of the matrix at hand to condensed form by two-sided orthogonal transformations. This step is standardly used to significantly accelerate the solution process. We present a performance analysis of the main two-sided factorizations used in these reductions: the bidiagonalization, tridiagonalization, and the upper Hessenberg factorizations on heterogeneous systems of multicore CPUs and Xeon Phi coprocessors. We derive a performance model and use it to guide the analysis and to evaluate performance. We develop optimized implementations for these methods that get up to 80% of the optimal performance bounds. Finally, we describe the heterogeneous multicore and coprocessor development considerations and the techniques that enable us to achieve these high-performance results. The work here presents the first highly optimized implementation of these main factorizations for Xeon Phi coprocessors. Compared to the LAPACK versions optmized by Intel for Xeon Phi (in MKL), we achieve up to 50% speedup.
The article addresses the challenges of software development for current and future parallel computers, which are expected to be dominated by multicore and many-core architectures. Using these multicore processors for...
详细信息
The article addresses the challenges of software development for current and future parallel computers, which are expected to be dominated by multicore and many-core architectures. Using these multicore processors for cluster systems will create systems with thousands of cores and deep memory hierarchies. To efficiently exploit the tremendous parallelism of these hardware platforms, a new generation of programming methodologies is needed. This article proposes a parallel programming methodology exploiting a task-based representation of application software. For the specification of task-based programs, a coordination language is presented, which uses external variables to express the cooperation between tasks. For the actual execution of a task-based program on a specific parallel architecture, different dynamic scheduling algorithms embedded into an execution environment are introduced. Runtime experiments for complex methods from a numerical analysis are performed on different parallel execution platforms.
The article addresses the challenges of software development for current and future parallel computers, which are expected to be dominated by multicore and many-core architectures. Using these multicore processors for...
详细信息
The article addresses the challenges of software development for current and future parallel computers, which are expected to be dominated by multicore and many-core architectures. Using these multicore processors for cluster systems will create systems with thousands of cores and deep memory hierarchies. To efficiently exploit the tremendous parallelism of these hardware platforms, a new generation of programming methodologies is needed. This article proposes a parallel programming methodology exploiting a task-based representation of application software. For the specification of task-based programs, a coordination language is presented, which uses external variables to express the cooperation between tasks. For the actual execution of a task-based program on a specific parallel architecture, different dynamic scheduling algorithms embedded into an execution environment are introduced. Runtime experiments for complex methods from a numerical analysis are performed on different parallel execution platforms.
programming models using parallel tasks provide portable performance and scalability for modular applications on many high-performance systems. This is achieved by the flexibility of a two-level programming structure ...
详细信息
ISBN:
(纸本)9781617828379
programming models using parallel tasks provide portable performance and scalability for modular applications on many high-performance systems. This is achieved by the flexibility of a two-level programming structure supporting mixed task and data parallelism. Due to the emerging importance of energy efficiency in high-performance computing, programming models with parallel tasks should be extended to be able to include energy concerns. based on a well-accepted analytical energy model for a processor's energy consumption, this article explores the energy consumption of parallel tasks with communication that are executed concurrently with other tasks. Simulations show the different energy consumption scenarios for different task cooperations and demonstrate the potential for a flexible energy usage on varying parallel platforms.
暂无评论