One of the major challenges in parallel computing over large scale clusters is fast and scalable process startup, which typically can be divided into two phases: process initiation and connection setup. In this paper,...
详细信息
Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable implementation of MPI-2 [16], based on experiences gained from FT-MP1 [7], LA-MPI [10], LAM/MPI [23], and MVAPICH [18] projec...
详细信息
ISBN:
(纸本)0769523129
Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable implementation of MPI-2 [16], based on experiences gained from FT-MP1 [7], LA-MPI [10], LAM/MPI [23], and MVAPICH [18] projects. Its initial communication architecture is layered on top of TCP/IP. In this paper, we have designed and implemented Open MPI point-to-point layer on top of a high-end interconnect, Quadrics/Elan4 [21]. The restriction of Quadrics static process model has been overcome to accommodate the requirement of MPI-2 dynamic process management. Quadrics Queued-based Direct Memory Access (QDMA) and Remote Direct Memory Access (RDMA) mechanisms have been integrated to form a low-overhead, high-performance transport layer. Light-weight asynchronous progress is made possible with a combination of Quadrics chained event and QDMA mechanisms. Experimental results indicate that the resulting point-to-point transport layer is able to achieve comparable performance to Quadrics native QDMA operations, from which it is derived. Our implementation provides an MPI-2 compliant message passing library over Quadrics/Elan4 with a performance comparable to MPICH-Quadrics.
Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collectiv...
详细信息
ISBN:
(纸本)0769521320
Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 network, the NIC-based barrier operation achieves a barrier latency of only 5.60μs. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20μs over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13μs latency over Quadrics and with 38.94μs latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.
One of the major challenges in parallel computing over large scale clusters is fast and scalable process startup, which typically can be divided into two phases: process initiation and connection setup. In this paper,...
详细信息
All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable network interface cards (NICs) have been leveraged...
详细信息
All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable network interface cards (NICs) have been leveraged to efficiently support collective operations, including barrier, broadcast, and reduce. This work explores the characteristics of all-to-all broadcast and proposes new algorithms to exploit the potential advantages of NIC programmability. Along with these algorithms, salient strategies have been used to provide scalable topology management, global buffer management, efficient communication processing, and message reliability. The algorithms have been incorporated into a NIC-based collective protocol over Myrinet/GM. The NIC-based all-to-all broadcast operations improve all-to-all broadcast bandwidth over 16 nodes by a factor of 3, compared to host-based all-to-all broadcast operation. Furthermore, the NIC-based operations have been demonstrated to achieve better scalability to large systems and very low host CPU utilization.
Summary form only given. Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. We explore different schemes to support...
详细信息
Summary form only given. Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU. We explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simplified with this collective protocol. Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet. Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 network, the NIC-based barrier operation achieves a barrier latency of only 5.60/spl mu/s. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20/spl mu/s over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13/spl mu/s latency over Quadrics and with 38.94/spl mu/s latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.
暂无评论