检索结果-内蒙古大学图书馆

Fast and scalable startup of MPI programs in InfiniBand clusters

11th International Conference on High Performance computing, HiPC 2004

作者： Yu, Weikuan Wu, Jiesheng Panda, Dhabaleswar K. Network-Based Computing Lab Dept. of Computer Science and Engineering The Ohio State University United States

ISBN: (纸本)9783540241294

One of the major challenges in parallel computing over large scale clusters is fast and scalable process startup, which typically can be divided into two phases: process initiation and connection setup. In this paper, we characterize the startup of MPI programs in InfiniBand clusters and identify two startup scalability issues: serialized process initiation in the initiation phase and high communication overhead in the connection setup phase. To reduce the connection setup time, we have developed one approach with data reassembly to reduce data volume, and another with a bootstrap channel to parallelize the communication. Furthermore, a process management framework, Multi-Purpose Daemons (MPD) system is exploited to speed up process initiation. Our experimental results show that job startup time has been improved by more than 4 times for 128-process jobs, and the improvement can be more than two orders of magnitude for 2048-process jobs as suggested by our analytical models. © Springer-Verlag Berlin Heidelberg 2004.

关键词： Artificial intelligence

来源：评论

学校读者我要写书评

暂无评论

Design and implementation of open MPI over Quadrics/Elan4

Design and implementation of open MPI over Quadrics/Elan4

引用

19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

作者： Yu, Weikuan Woodall, Tim S. Graham, Rich L. Panda, Dhabaleswar K. Network-Based Computing Lab. Dept. of Computer Sci. and Engineering Ohio State University Advanced Computing Laboratory Computer and Computation Sci. Division Los Alamos National Laboratory

ISBN: (纸本)0769523129

Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable implementation of MPI-2 [16], based on experiences gained from FT-MP1 [7], LA-MPI [10], LAM/MPI [23], and MVAPICH [18] projects. Its initial communication architecture is layered on top of TCP/IP. In this paper, we have designed and implemented Open MPI point-to-point layer on top of a high-end interconnect, Quadrics/Elan4 [21]. The restriction of Quadrics static process model has been overcome to accommodate the requirement of MPI-2 dynamic process management. Quadrics Queued-based Direct Memory Access (QDMA) and Remote Direct Memory Access (RDMA) mechanisms have been integrated to form a low-overhead, high-performance transport layer. Light-weight asynchronous progress is made possible with a combination of Quadrics chained event and QDMA mechanisms. Experimental results indicate that the resulting point-to-point transport layer is able to achieve comparable performance to Quadrics native QDMA operations, from which it is derived. Our implementation provides an MPI-2 compliant message passing library over Quadrics/Elan4 with a performance comparable to MPICH-Quadrics.

关键词： Multiprocessing systems

来源：评论

学校读者我要写书评

暂无评论

Efficient and scalable barrier over quadrics and myrinet with a new NIC-based collective message passing protocol

Efficient and scalable barrier over quadrics and myrinet wit...

引用

Proceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)

作者： Yu, Weikuan Buntinas, Darius Graham, Rich L. Panda, Dhabaleswar K. Network-Based Computing Lab. Dept. of Computer Science Ohio State University Argonne National Laboratory Mathematics and Computer Science Argonne IL 60439 Los Alamos National Laboratory Advanced Computing Laboratory Los Alamos NM 87545

ISBN: (纸本)0769521320

Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 network, the NIC-based barrier operation achieves a barrier latency of only 5.60μs. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20μs over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13μs latency over Quadrics and with 38.94μs latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.

关键词： network protocols

来源：评论

学校读者我要写书评

暂无评论

Fast and scalable startup of MPI Programs in infiniband clusters

Lecture Notes in Computer Science (including subseries Lectu...

引用

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2004年 3296卷 440-449页

作者： Yu, Weikuan Wu, Jiesheng Panda, D.K. Network-Based Computing Lab Dept. of Computer Science and Engineering Ohio State University United States

关键词： Artificial intelligence

来源：评论

学校读者我要写书评

暂无评论

Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM

Scalable, high-performance NIC-based all-to-all broadcast ov...

引用

IEEE International Conference on Cluster computing

作者： Weikuan Yu D.K. Panda D. Buntinas Network-Based Computing Lab Department of Computer Science and Engineering Ohio State Uinversity USA Mathematics and Computer Science Division Argonne National Laboratory USA

All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable network interface cards (NICs) have been leveraged to efficiently support collective operations, including barrier, broadcast, and reduce. This work explores the characteristics of all-to-all broadcast and proposes new algorithms to exploit the potential advantages of NIC programmability. Along with these algorithms, salient strategies have been used to provide scalable topology management, global buffer management, efficient communication processing, and message reliability. The algorithms have been incorporated into a NIC-based collective protocol over Myrinet/GM. The NIC-based all-to-all broadcast operations improve all-to-all broadcast bandwidth over 16 nodes by a factor of 3, compared to host-based all-to-all broadcast operation. Furthermore, the NIC-based operations have been demonstrated to achieve better scalability to large systems and very low host CPU utilization.

关键词： Broadcasting Topology Bandwidth Protocols Computer science network interfaces Scalability Clustering algorithms Computer networks Concurrent computing

来源：评论

学校读者我要写书评

暂无评论

Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol

Efficient and scalable barrier over Quadrics and Myrinet wit...

引用

International Symposium on Parallel and Distributed Processing (IPDPS)

作者： W. Yu D. Buntinas R.L. Graham D.K. Panda Network-Based Computing Lab Department of Computer and Info. Science Ohio State Uinversity USA Mathematics and Computer Science Argonne National Laboratory Argonne IL USA Los Alamos National Laboratory Advanced Computing Laboratory Los Alamos NM USA

Summary form only given. Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. We explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 network, the NIC-based barrier operation achieves a barrier latency of only 5.60/spl mu/s. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20/spl mu/s over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13/spl mu/s latency over Quadrics and with 38.94/spl mu/s latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.

关键词： Message passing Protocols Delay Hardware Broadcasting laboratories Computer networks network interfaces Mathematics Computer science

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：