This paper investigates the multi-valued fault-tolerant distributed consensus problem that pursues exact output. To this end, the voting validity, which requires the consensus output of non-faulty nodes to be the exac...
详细信息
ISBN:
(纸本)9798350337662
This paper investigates the multi-valued fault-tolerant distributed consensus problem that pursues exact output. To this end, the voting validity, which requires the consensus output of non-faulty nodes to be the exact plurality of the input of non-faulty nodes, is investigated. Considering a specific distribution of nonfaulty votes, we first give the impossibility results and a tight lower bound of system tolerance achieving agreement, termination and voting validity. A practical consensus algorithm that satisfies voting validity in the Byzantine fault model is proposed subsequently. To ensure the exactness of outputs in any non-faulty vote distribution, we further propose safety-critical tolerance and a corresponding protocol that prioritizes voting validity over termination property. To refine the proposed protocols, we propose an incremental threshold algorithm that accelerates protocol operation speed. We also optimize consensus algorithms with the local broadcast model to enhance the protocol's fault tolerance ability.
Incremental quantum circuit simulation has emerged as an important tool for simulation-driven quantum applications, such as circuit synthesis, verification, and analysis. When a small portion of the circuit is modifie...
详细信息
ISBN:
(纸本)9798350337662
Incremental quantum circuit simulation has emerged as an important tool for simulation-driven quantum applications, such as circuit synthesis, verification, and analysis. When a small portion of the circuit is modified, the simulator must incrementally update state amplitudes for reasonable turnaround time and productivity. However, this type of incrementality has been largely ignored by existing research. To fill this gap, we introduce a new incremental quantum circuit simulator called qTask. qTask leverages a task-parallel decomposition strategy to explore both inter- and intra-gate operation parallelisms from partitioned data blocks. Our partitioning strategy effectively narrows down incremental update to a small set of partitions affected by circuit modifiers. We have demonstrated the promising performance of qTask on QASMBench benchmarks. Compared to two state-of-the-art simulators, Qulacs and Qiskit, qTask is respectively 1.46x and 1.71x faster for full simulation and 5.77x and 9.76x faster for incremental simulation.
Peer-to-peer networks, such as IPFS, adopt the distributed hash table (DHT) to efficiently nd contents. In particular, Kademlia, which is used in IPFS, requires parameters regarding the lookup concurrency, the number ...
详细信息
ISBN:
(纸本)9798350311990
Peer-to-peer networks, such as IPFS, adopt the distributed hash table (DHT) to efficiently nd contents. In particular, Kademlia, which is used in IPFS, requires parameters regarding the lookup concurrency, the number of next hops, and the k-bucket size. However, such values are manually set and then the con guration is not optimal for minimizing the network latency in any network dynamics. In this paper, we present a method for automatically deriving the optimal lookup parameters for KadRTT, which is a modified version of Kademlia to improve the lookup latency. We derive the optimal values for the k-bucket size, lookup concurrency, and the number of next hops using the lookup message arrival rate, initial ID distance, and lookup iteration count. From the experimental comparisons by both a simulation and an emulation, we show that our proposal contributes to the lookup latency, and overlay hop count.
In this paper, we propose a new high-performance computing approach to road traffic simulation. Multi-agent road traffic simulators are the most accurate and realistic that currently exist, however they are very resou...
详细信息
ISBN:
(纸本)9798350311990
In this paper, we propose a new high-performance computing approach to road traffic simulation. Multi-agent road traffic simulators are the most accurate and realistic that currently exist, however they are very resource intensive and the data are massive when simulating a metropolis or a region. The main contribution of this paper is the presentation of a new concept based on the Unite and Conquer approach, allowing to set up several parallel executions with different parameters. Applied on MATSim, one of the most present multi-agent traffic simulators in the literature, it allows to reduce the number of iterations needed to converge to the optimal solution. In this paper, we show how the Unite and Conquer approach applied to MATSim brings a substantial gain in computing time and points out its potential application to other multi-agent simulators.
The Vertically Integrated Projects (VIP) Program at Georgia Tech provides a multidisciplinary research experience aimed at engaging undergraduate and graduate research students in large-scale computing research projec...
详细信息
ISBN:
(纸本)9798350311990
The Vertically Integrated Projects (VIP) Program at Georgia Tech provides a multidisciplinary research experience aimed at engaging undergraduate and graduate research students in large-scale computing research projects. Since 2019, the Future Computing with the Rogues Gallery VIP course has engaged over 75 students in research on topics related to novel architectures and "post-Moore" computing platforms built around quantum, neuromorphic, near-memory, and reconfigurable computing. One of the key takeaways from this course for the course designers has been on the correlation between these novel computing platforms and traditional skills, techniques, and tools that are used in the HPC and parallel computing arenas. We discuss these parallels as well as the impacts of this course on general student success and research outcomes.
parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable...
详细信息
ISBN:
(纸本)9798350337662
parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely - data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
As the availability of SAR images continues to grow, efficient coregistration of massive SAR images presents a greater challenge. Traditional serial coregistration methods impose an unbearable time overhead. To reduce...
详细信息
ISBN:
(纸本)9798350320107
As the availability of SAR images continues to grow, efficient coregistration of massive SAR images presents a greater challenge. Traditional serial coregistration methods impose an unbearable time overhead. To reduce this overhead and make full use of computing resources, a parallel coregistration strategy based on Hadoop is proposed for SAR images. The Hadoop distributed File System (HDFS) is used to store SAR image data in chunks, and Hadoop's distributed computing strategy MapReduce is used to realize distributedparallelprocessing of SAR images. Two distributedparallel coregistration methods are presented with the proposed parallel strategy: one based on the maximum correlation method and the other on the DEM-assisted coregistration method. These methods are evaluated through coregistration experiments on the same dataset, and they are verified by comparing the coregistration results and processing time.
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random pro...
详细信息
ISBN:
(纸本)9798350337662
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from highdimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://***/HipGraph/DRPT.
We present an iterative breadth-first approach to maximum clique enumeration on the GPU. The memory required to store all of the intermediate clique candidates poses a significant challenge. To mitigate this issue, we...
详细信息
ISBN:
(纸本)9798350311990
We present an iterative breadth-first approach to maximum clique enumeration on the GPU. The memory required to store all of the intermediate clique candidates poses a significant challenge. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting tradeoff between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.
We present GraphTensor, a comprehensive opensource framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate b...
详细信息
ISBN:
(纸本)9798350337662
We present GraphTensor, a comprehensive opensource framework that supports efficient parallel neural network processing on large graphs. GraphTensor offers a set of easy-to-use programming primitives that appreciate both graph and neural network execution behaviors from the beginning (graph sampling) to the end (dense data processing). Our framework runs diverse graph neural network (GNN) models in a destination-centric, feature-wise manner, which can significantly shorten training execution times in a GPU. In addition, GraphTensor rearranges multiple GNN kernels based on their system hyperparameters in a self-governing manner, thereby reducing the processing dimensionality and the latencies further. From the end-to-end execution viewpoint, GraphTensor significantly shortens the service-level GNN latency by applying pipeline parallelism for efficient graph dataset preprocessing. Our evaluation shows that GraphTensor exhibits 1.4x better training performance than emerging GNN frameworks under the execution of large-scale, real-world graph workloads. For the end-to-end services, GraphTensor reduces training latencies of an advanced version of the GNN frameworks (optimized for multi-threaded graph sampling) by 2.4x, on average.
暂无评论