As the size of High Performance Computing clusters grows, so does the probability of interconnect hot spots that degrade the latency and effective bandwidth the network provides. This paper presents a solution to this...
详细信息
As the size of High Performance Computing clusters grows, so does the probability of interconnect hot spots that degrade the latency and effective bandwidth the network provides. This paper presents a solution to this scalability problem for real life constant bisectional-bandwidth fat-tree topologies. It is shown that maximal bandwidth and cut-through latency can be achieved for MPI global collective traffic. To form such a congestion-free configuration, MPI programs should utilize collective communication, MPI-node-order should be topology aware, and the packet routing should match the MPI communication patterns. First, we show that MPI collectives can be classified into unidirectional and bidirectional shifts. Using this property, we propose a scheme for congestion-free routing of the global collectives in fully and partially populated fat trees running a single job. The no-contention result is then obtained for multiple jobs running on the same fat-tree by applying some job size and placement restrictions. Simulation results of the proposed routing, MPI-node-order and communication patterns show no contention which provides a 40% throughput improvement over previously published results for all-to-all collectives. (C) 2012 Elsevier Inc. All rights reserved.
This article describes the ibm blue gene/q interconnection network and message unit. Blue gene/q is the third generation in the ibm blue gene line of massively parallel supercomputers and can be scaled to 20 petaflops...
详细信息
This article describes the ibm blue gene/q interconnection network and message unit. Blue gene/q is the third generation in the ibm blue gene line of massively parallel supercomputers and can be scaled to 20 petaflops and beyond. For better application scalability and performance, blue gene/q has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.
This paper presents the underlying theory and the performance of a cluster using a new 2-hop network topology. This topology is constructed using a symmetric equation and Singer Difference Sets and is called SymSig. T...
详细信息
ISBN:
(纸本)9781479907298
This paper presents the underlying theory and the performance of a cluster using a new 2-hop network topology. This topology is constructed using a symmetric equation and Singer Difference Sets and is called SymSig. The degree of connections at each node with SymSig is about half compared to previous methods using Singer Difference Sets. A comparison with a cluster of Clos topology shows significant advantages. The worst case congestion in SymSig topology for unicast permutation is 2, where as in Clos it is proportional to the radix of the building block switches used. The number of switches required is smaller by about 25%, the size of the cluster is larger by about 15% and the worst bandwidth is better by about 50% for SymSig. These advantages are retained for peta and exascale systems. Its performance on a set of collectives like exchange-all, shift-all, broadcast-all and all-to-all send/receive shows improvements ranging from 39% to 83%. Its performance on a molecular dynamics application GROMMACS shows improvement of upto 33%. This network is particularly suitable for applications that require global all to all communications. The low latency of this network makes it scaleable and an attractive alternative for building peta and exascale systems.
暂无评论