We present a complete classification of the deterministic distributed time complexity for a family of graph problems: binary labeling problems in trees in the usual LOCAL model of distributedcomputing. These are loca...
详细信息
ISBN:
(纸本)9781450375825
We present a complete classification of the deterministic distributed time complexity for a family of graph problems: binary labeling problems in trees in the usual LOCAL model of distributedcomputing. These are locally checkable problems that can be encoded with an alphabet of size two in the edge labeling formalism. Examples of binary labeling problems include sinkless orientation, sinkless and sourceless orientation, 2-vertex coloring, and perfect matching. We show that the complexity of any such problem is in one of the following classes: O(1), Theta(log n), Theta(n), or unsolvable. Furthermore, given the description of any binary labeling problem, we can easily determine in which of the four classes it is and what is an asymptotically optimal algorithm for solving it.
We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (Apsp) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matri...
详细信息
ISBN:
(纸本)9781450382175
We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (Apsp) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel Floyd-Warshall (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256 nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.
The problem of coloring the edges of an n-node graph of maximum degree. with 2 Delta - 1 colors is one of the key symmetry breaking problems in the area of distributed graph algorithms. While there has been a lot of p...
详细信息
ISBN:
(纸本)9781450375825
The problem of coloring the edges of an n-node graph of maximum degree. with 2 Delta - 1 colors is one of the key symmetry breaking problems in the area of distributed graph algorithms. While there has been a lot of progress towards the understanding of this problem, the dependency of the running time on. has been a long-standing open question. Very recently, Kuhn [SODA '20] showed that the problem can be solved in time 2O(root log Delta) + O(log* n). In this paper, we study the edge coloring problem in the distributed LOCAL model. We show that the (degree + 1)-list edge coloring problem, and thus also the (2 Delta- 1)-edge coloring problem, can be solved deterministically in time 2(O(log2 log Delta))+ O(log* n). This is a significant improvement over the result of Kuhn [SODA '20].
High-energy physics faces unprecedented computing challenges in preparation for the 'high-luminosity' phase of the Large Hadron Collider, which will be known as the HL-LHC. The complexity of particle-collision...
详细信息
ISBN:
(纸本)9781450382175
High-energy physics faces unprecedented computing challenges in preparation for the 'high-luminosity' phase of the Large Hadron Collider, which will be known as the HL-LHC. The complexity of particle-collision events will increase, together with the data collection rate, substantially outstripping the gains expected from technology evolution. The LHC experiments, through the Worldwide LHC computing Grid (WLCG), operate a distributedcomputing infrastructure at about 170 sites over more than 40 countries. This infrastructure has successfully exploited the exabyte of data collected and processed during the first 10 years of the program. During the HL-LHC regime, each experiment will collect an exabyte of data annually and additional computing resources will be needed. The efficient use of HPC facilities may be an important opportunity to address the anticipated resource gap. In this talk, I will discuss the future computing needs in high-energy physics and how these can be met combining our dedicated distributedcomputing infrastructure with large-scale HPC sites. As a community, we have identified common challenges for integrating these large facilities into our computing ecosystem. I will also discuss the current progress in addressing those challenges, focusing on software development for heterogeneous architectures, data management at scale, supporting services and opportunities for collaboration.
All-reduce is the crucial communication primitive to reduce model parameters in distributed Deep Neural Networks (DNN) training. Most existing all-reduce algorithms are designed for traditional electrical interconnect...
详细信息
ISBN:
(纸本)9798400700156
All-reduce is the crucial communication primitive to reduce model parameters in distributed Deep Neural Networks (DNN) training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs due to the low data bandwidth of the electrical interconnect systems. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect systems. WRHT can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. Simulations using real DNN models show that, compared to all-reduce algorithms in the electrical and optical network systems, our approach reduces communication time by 75.76% and 91.86%, respectively.
distributed quantum computing (DQC) is a promising approach to extending the computational power of near-term quantum hardware. However, the non-local quantum communication between quantum nodes is much more expensive...
详细信息
ISBN:
(纸本)9781665462723
distributed quantum computing (DQC) is a promising approach to extending the computational power of near-term quantum hardware. However, the non-local quantum communication between quantum nodes is much more expensive and error-prone than the local quantum operation within each quantum device. Previous DQC compilers focus on optimizing the implementation of each non-local gate and adopt similar compilation designs to single-node quantum compilers. The communication patterns in distributed quantum programs remain unexplored, leading to a far-from-optimal communication cost. In this paper, we identify burst communication, a specific qubit-node communication pattern that widely exists in various distributed quantum programs and can be leveraged to guide communication overhead optimization. We then propose AutoComm, an automatic compiler framework to extract burst communication patterns from input programs and then optimize the communication steps of burst communication discovered. Compared to state-of-the-art DQC compilers, experimental results show that our proposed AutoComm can reduce the communication resource consumption and the program latency by 72.9% and 69.2% on average, respectively.
We give efficient randomized and deterministic distributed algorithms for computing a distance-2 vertex coloring of a graph G in the CONGEST model. In particular, if Delta is the maximum degree of G, we show that ther...
详细信息
ISBN:
(纸本)9781450375825
We give efficient randomized and deterministic distributed algorithms for computing a distance-2 vertex coloring of a graph G in the CONGEST model. In particular, if Delta is the maximum degree of G, we show that there is a randomized CONGEST model algorithm to compute a distance-2 coloring of G with Delta(2) + 1 colors in O(log Delta . log n) rounds. Further if the number of colors is slightly increased to (1 + epsilon)Delta(2) for some epsilon > 1/polylog n, we show that it is even possible to compute a distance-2 coloring deterministically in polylog n time in the CONGEST model. Finally, we give a O(Delta(2) + log* n)-round deterministic CONGEST algorithm to compute distance-2 coloring with Delta(2) + 1 colors.
暂无评论