This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same autho...
详细信息
This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory;(2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse;(3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout;and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000x12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement...
详细信息
ISBN:
(纸本)9780769548647;9781467330275
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving overdetermined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the PowerPack framework and the Running Average Power Limit model, respectively.
Classical solvers for the dense symmetric eigenvalue problem suffer from the first step, which involves a reduction to tridiagonal form that is dominated by the cost of accessing memory during the panel factorization....
详细信息
Classical solvers for the dense symmetric eigenvalue problem suffer from the first step, which involves a reduction to tridiagonal form that is dominated by the cost of accessing memory during the panel factorization. The solution is to reduce the matrix to a banded form, which then requires the eigenvalues of the banded matrix to be computed. The standard divide and conquer algorithm can be modified for this purpose. The paper combines this insight with tile algorithms that can be scheduled via a dynamic runtime system to multicore architectures. A detailed analysis of performance and accuracy is included. Performance improvements of 14-fold and 4-fold speedups are reported relative to LAPACK and Intel's Math Kernel Library.
The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, and QR factorizations to the family of two-sided factorizat...
详细信息
The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, and QR factorizations to the family of two-sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a preprocessing step for calculating the Singular Value Decomposition. Furthermore, in the Top500 list of June 2008, 98 percent of the fastest parallel systems in the world were based on multicores. This confronts the scientific software community with both a daunting challenge and a unique opportunity. The challenge arises from the disturbing mismatch between the design of systems based on this new chip architecture-hundreds of thousands of nodes, a million or more cores, reduced bandwidth and memory available to cores-and the components of the traditional software stack, such as numerical libraries, on which scientific applications have relied for their accuracy and performance. The many-core trend has even more exacerbated the problem, and it becomes critical to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of tile algorithms in the multicore environment (i.e., high level of parallelism with fine granularity and high-performance data representation combined with a dynamic data-driven execution), the band bidiagonal reduction presented here achieves 94 Gflop/s on a 12;000 x 12;000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for the bidiagonal reduction is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrix to the required form.
This paper presents the power profile of two high performance dense linear algebra libraries i.e., LAPACK and PLASMA. The former is based on block algorithms that use the fork-join paradigm to achieve parallel perform...
详细信息
This paper presents the power profile of two high performance dense linear algebra libraries i.e., LAPACK and PLASMA. The former is based on block algorithms that use the fork-join paradigm to achieve parallel performance. The latter uses fine-grained task parallelism that recasts the computation to operate on submatrices called tiles. In this way tile algorithms are formed. We show results from the power profiling of the most common routines, which permits us to clearly identify the different phases of the computations. This allows us to isolate the bottlenecks in terms of energy efficiency. Our results show that PLASMA surpasses LAPACK not only in terms of performance but also in terms of energy efficiency.
The development of radar systems on general-purpose off-the-shelf parallel hardware represents an effective means of providing efficient implementations with reasonable realisation costs. However, the fulfilment of th...
详细信息
The development of radar systems on general-purpose off-the-shelf parallel hardware represents an effective means of providing efficient implementations with reasonable realisation costs. However, the fulfilment of the required real-time constraints poses serious problems of performance and efficiency: parallel architectures need to be exploited at best, providing scalable parallelisations able to reach the desired throughput and latency levels. In this paper we discuss the implementation issues of the computational kernel of a well-known radar filtering technique - the space-time adaptive processing - on today's general-purpose parallel architectures (multi-/many-core platforms). In order to address the performance constraints imposed by the realtime implementation of this filtering technique, we apply a structured approach (structured parallel programing) to develop parallel computations as instances and compositions of well-known parallelisation patterns. This paper provides a thorough description of the implementation issues and discusses the performance peaks achievable on a broad range of existing multi-core architectures.
The growing popularity of the Intel Xeon Phi coprocessors and the continued development of this new many-core architecture have created the need for an open-source, scalable, and cross-platform taskbased dense linear ...
详细信息
ISBN:
(纸本)9783319460796;9783319460789
The growing popularity of the Intel Xeon Phi coprocessors and the continued development of this new many-core architecture have created the need for an open-source, scalable, and cross-platform taskbased dense linear algebra package that can efficiently use this type of hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, to run effectively on Intel's Knights Corner Xeon Phi coprocessor. First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the new OpenMP tiled Cholesky decomposition on a Knights Corner coprocessor. We found that desirable performance for this architecture was attainable with the right code optimizations;these changes were necessary to account for differences in the runtimes and in the hardware itself.
We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing effici...
详细信息
ISBN:
(纸本)9780769546759
We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (similar to 130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms o
The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented....
详细信息
The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented. Extensive benchmarks are targeted on three recent multicore and manycore architectures, namely, an Intel Xeon, Intel Xeon Phi, and IBM POWER 8 processors.
暂无评论