检索结果-内蒙古大学图书馆

Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation

JOURNAL OF COMPUTATIONAL SCIENCE 2013年第6期4卷 480-488页

作者： Gansterer, Wilfried N. Niederbrucker, Gerhard Strakova, Hana Grotthoff, Stefan Schulze Univ Vienna Res Grp Theory & Applicat Algorithms A-1090 Vienna Austria

The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. (C) 2013 Elsevier B.V. All rights reserved.

关键词： Distributed reduction operation push-flow algorithm Distributed orthogonalization Distributed matrix computations Fault tolerant matrix computations

来源：评论

学校读者我要写书评

暂无评论

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2019年第2期33卷 366-383页

作者： Casas, Marc Gansterer, Wilfried N. Wimmer, Elias Barcelona Supercomp Ctr Barcelona Spain Univ Vienna Fac Comp Sci Res Grp Theory & Applicat Algorithms Vienna Austria TU Wien Fac Informat Res Grp Parallel Comp Vienna Austria

We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.

关键词： All-to-all reduction all-reduce gossip algorithm fault tolerance bit-flip silent data corruption recursive doubling push-flow algorithm

来源：评论

学校读者我要写书评

暂无评论

Improving Fault Tolerance and Accuracy of a Distributed Reduction algorithm

Improving Fault Tolerance and Accuracy of a Distributed Redu...

引用

25th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Niederbrucker, Gerhard Strakova, Hana Gansterer, Wilfried N. Univ Vienna Res Grp Theory & Applicat Algorithms A-1010 Vienna Austria

ISBN: (纸本)9780769549569;9781467362184

Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the push-flow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems. We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.

关键词： push-cancel-flow algorithm push-flow algorithm push-sum algorithm

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：