检索结果-内蒙古大学图书馆

High-Performance Bidiagonal Reduction using tile algorithms on Homogeneous Multicore Architectures

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2013年第3期39卷 1-22页

作者： Ltaief, Hatem Luszczek, Piotr Dongarra, Jack Kaust Supercomp Lab Thuwal Saudi Arabia Univ Tennessee Dept Elect Engn & Comp Sci Innovat Comp Lab Knoxville TN 37996 USA Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Manchester M13 9PL Lancs England

This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory;(2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse;(3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout;and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000x12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.

关键词： algorithms Performance Bidiagional reduction tile algorithms two-stage approach bulge chasing data translation layer high performance kernels dynamic scheduling

来源：评论

学校读者我要写书评

暂无评论

Energy Footprint of Advanced Dense Numerical Linear Algebra using tile algorithms on Multicore Architectures

Energy Footprint of Advanced Dense Numerical Linear Algebra ...

引用

2nd International Conference on Cloud and Green Computing / 2nd International Conference on Social Computing and its Applications (CGC/SCA)

作者： Dongarra, Jack Ltaief, Hatem Luszczek, Piotr Weaver, Vincent M. Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN USA Univ Manchester Sch Math Sch Comp Sci Manchester NH USA Natl Sci Fdn & Microsoft Research Manchester NH USA KAUST Supercomputing Lab Thuwal Saudi Arabia

ISBN: (纸本)9780769548647;9781467330275

We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving overdetermined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the PowerPack framework and the Running Average Power Limit model, respectively.

关键词： Power Consumption Dense Linear Algebra Mixed Precision algorithms Tree Reduction tile algorithms Dynamic Scheduling PowerPack RAPL

来源：评论

学校读者我要写书评

暂无评论

TOWARD A HIGH PERFORMANCE tile DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2012年第6期34卷 C249-C274页

作者： Haidar, Azzam Ltaief, Hatem Dongarra, Jack Univ Tennessee Dept Elect Engn & Comp Sci Knoxville TN 37996 USA KAUST Supercomp Lab Thuwal Saudi Arabia Oak Ridge Natl Lab Div Math & Comp Sci Oak Ridge TN USA Univ Manchester Sch Math Manchester Lancs England Univ Manchester Sch Comp Sci Manchester Lancs England

Classical solvers for the dense symmetric eigenvalue problem suffer from the first step, which involves a reduction to tridiagonal form that is dominated by the cost of accessing memory during the panel factorization. The solution is to reduce the matrix to a banded form, which then requires the eigenvalues of the banded matrix to be computed. The standard divide and conquer algorithm can be modified for this purpose. The paper combines this insight with tile algorithms that can be scheduled via a dynamic runtime system to multicore architectures. A detailed analysis of performance and accuracy is included. Performance improvements of 14-fold and 4-fold speedups are reported relative to LAPACK and Intel's Math Kernel Library.

关键词： divide and conquer symmetric eigenvalue solver tile algorithms dynamic scheduling

来源：评论

学校读者我要写书评

暂无评论

Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2010年第4期21卷 417-423页

作者： Ltaief, Hatem Kurzak, Jakub Dongarra, Jack Univ Tennessee Dept Elect Engn & Comp Sci Innovat Comp Lab Knoxville TN 37996 USA

The objective of this paper is to extend, in the context of multicore architectures, the concepts of tile algorithms [Buttari et al., 2007] for Cholesky, LU, and QR factorizations to the family of two-sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a preprocessing step for calculating the Singular Value Decomposition. Furthermore, in the Top500 list of June 2008, 98 percent of the fastest parallel systems in the world were based on multicores. This confronts the scientific software community with both a daunting challenge and a unique opportunity. The challenge arises from the disturbing mismatch between the design of systems based on this new chip architecture-hundreds of thousands of nodes, a million or more cores, reduced bandwidth and memory available to cores-and the components of the traditional software stack, such as numerical libraries, on which scientific applications have relied for their accuracy and performance. The many-core trend has even more exacerbated the problem, and it becomes critical to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of tile algorithms in the multicore environment (i.e., high level of parallelism with fine granularity and high-performance data representation combined with a dynamic data-driven execution), the band bidiagonal reduction presented here achieves 94 Gflop/s on a 12;000 x 12;000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for the bidiagonal reduction is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrix to the required form.

关键词： Bidiagonal reduction singular value decomposition tile algorithms multicores

来源：评论

学校读者我要写书评

暂无评论

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

引用

COMPUTER SCIENCE-RESEARCH AND DEVELOPMENT 2012年第4期27卷 277-287页

作者： Ltaief, Hatem Luszczek, Piotr Dongarra, Jack KAUST Supercomp Lab Thuwal Saudi Arabia Univ Tennessee Dept Elect Engn & Comp Sci Knoxville TN 37996 USA

This paper presents the power profile of two high performance dense linear algebra libraries i.e., LAPACK and PLASMA. The former is based on block algorithms that use the fork-join paradigm to achieve parallel performance. The latter uses fine-grained task parallelism that recasts the computation to operate on submatrices called tiles. In this way tile algorithms are formed. We show results from the power profiling of the most common routines, which permits us to clearly identify the different phases of the computations. This allows us to isolate the bottlenecks in terms of energy efficiency. Our results show that PLASMA surpasses LAPACK not only in terms of performance but also in terms of energy efficiency.

关键词： Power profile Energy efficiency Dense linear algebra tile algorithms Multicore architectures

来源：评论

学校读者我要写书评

暂无评论

Performance analysis and structured parallelisation of the space-time adaptive processing computational kernel on multi-core architectures

引用

INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS 2014年第5期29卷 460-498页

作者： Buono, Daniele Mencagli, Gabriele Pascucci, Alessio Vanneschi, Marco Univ Pisa Dept Comp Sci Largo B Pontecorvo 3 I-56127 Pisa Italy

The development of radar systems on general-purpose off-the-shelf parallel hardware represents an effective means of providing efficient implementations with reasonable realisation costs. However, the fulfilment of the required real-time constraints poses serious problems of performance and efficiency: parallel architectures need to be exploited at best, providing scalable parallelisations able to reach the desired throughput and latency levels. In this paper we discuss the implementation issues of the computational kernel of a well-known radar filtering technique - the space-time adaptive processing - on today's general-purpose parallel architectures (multi-/many-core platforms). In order to address the performance constraints imposed by the realtime implementation of this filtering technique, we apply a structured approach (structured parallel programing) to develop parallel computations as instances and compositions of well-known parallelisation patterns. This paper provides a thorough description of the implementation issues and discusses the performance peaks achievable on a broad range of existing multi-core architectures.

关键词： space-time adaptive processing tile algorithms multi-core architectures structured parallel programing performance evaluation

来源：评论

学校读者我要写书评

暂无评论

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction

A Comprehensive Study of Task Coalescing for Selecting Paral...

引用

26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) / Workshop on High Performance Data Intensive Computing

作者： Haidar, Azzam Ltaief, Hatem Luszczek, Piotr Dongarra, Jack Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA KAUST Supercomput Lab Thuwal Saudi Arabia

ISBN: (纸本)9780769546759

We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (similar to 130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms o

关键词： Bidiagonal Reduction Two-Stage Approach tile algorithms Bulge Chasing Granularity Analysis Power Profiling Dynamic Scheduling

来源：评论

学校读者我要写书评

暂无评论

Task-Based Cholesky Decomposition on Knights Corner Using OpenMP

Task-Based Cholesky Decomposition on Knights Corner Using Op...

引用

International Supercomputing Conference (ISC High Performance)

作者： Dorris, Joseph Kurzak, Jakub Luszczek, Piotr YarKhan, Asim Dongarra, Jack Innovat Comp Lab Knoxville TN 37996 USA

ISBN: (纸本)9783319460796;9783319460789

The growing popularity of the Intel Xeon Phi coprocessors and the continued development of this new many-core architecture have created the need for an open-source, scalable, and cross-platform taskbased dense linear algebra package that can efficiently use this type of hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, to run effectively on Intel's Knights Corner Xeon Phi coprocessor. First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the new OpenMP tiled Cholesky decomposition on a Knights Corner coprocessor. We found that desirable performance for this architecture was attainable with the right code optimizations;these changes were necessary to account for differences in the runtimes and in the hardware itself.

关键词： Cholesky decomposition Linear algebra OpenMP PLASMA Task-based programming tile algorithms Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP

引用

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2019年第2期45卷 16-16页

作者： Dongarra, Jack Gates, Mark Haidar, Azzam Kurzak, Jakub Luszczek, Piotr Wu, Panruo Yamazaki, Ichitaro Yarkhan, Asim Abalenkovs, Maksims Bagherpour, Negin Hammarling, Sven Sistek, Jakub Stevens, David Zounon, Mawussi Relton, Samuel D. Univ Tennessee Dept Elect Engn & Comp Sci 1122 Volunteer BlvdSuite 203 Knoxville TN 37996 USA Univ Houston Dept Comp Sci 3551 Cullen Blvd Houston TX 77204 USA Univ Manchester Sch Math Manchester M13 9PL Lancs England Czech Acad Sci Inst Math Zitna 25 Prague 11567 Czech Republic Numer Algorithms Grp Manchester One53 Portland St Manchester M1 3LD Lancs England Univ Leeds Inst Hlth Sci Leeds LS2 9LJ W Yorkshire England

The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented. Extensive benchmarks are targeted on three recent multicore and manycore architectures, namely, an Intel Xeon, Intel Xeon Phi, and IBM POWER 8 processors.

关键词： Numerical linear algebra libraries tile algorithms task-based programming multicore processors OpenMP PLASMA

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：