In the era of exascale computing, the adoption of a large number of CPU cores and nodes by high-performance computing (HPC) applications has made MPI collective performance increasingly crucial. As the number of cores...
详细信息
ISBN:
(纸本)9798350307924
In the era of exascale computing, the adoption of a large number of CPU cores and nodes by high-performance computing (HPC) applications has made MPI collective performance increasingly crucial. As the number of cores and nodes increases, the importance of optimizing MPI collective performance becomes more evident. Current collective algorithms, including kernel-assisted inter-process data exchange techniques and data sharing based shared-memory approaches, are prone to significant performance degradation due to the overhead of system calls and page faults or the cost of extra data-copy latency. These issues can negatively impact the efficiency and scalability of HPC applications. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Interprocess MPI Collective design that maximizes small message MPI collective performance at scale. We also present specific designs to boost the performance for larger messages, such that we observe a comprehensive improvement for a series of message sizes beyond small messages. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Processin-Process shared memory techniques to eliminate unnecessary system call, page fault overhead and extra data copy, which results in improved intra- and inter-node message rate and throughput. Experimental results demonstrate that PiP-MColl significantly outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for the MPI collectives MPI Scatter, MPI Allgather, and MPI Allreduce.
Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to solve the appropriate systems of d...
详细信息
High performance computing (HPC) systems have become highly parallel aggregations of heterogeneous system elements. Different kinds of processors, memory regions, interconnects and software resources constitute the mo...
详细信息
ISBN:
(纸本)9783030967727;9783030967710
High performance computing (HPC) systems have become highly parallel aggregations of heterogeneous system elements. Different kinds of processors, memory regions, interconnects and software resources constitute the modern HPC computing platform. This makes software development and efficient program execution a challenging task. Previously, we have developed a platform description framework for describing multiple aspects of computing platforms. It enables tools and users to better cope with the complexities of heterogeneous platforms in a programming model and system independent way. In this paper we present how our platform model can be used to describe program implementation variants that utilize different parallel programming models. We show that by matching platform models of program implementations to descriptions of a concrete heterogeneous system we can increase overall resource utilization. In addition, we show that our model featuring control relationships brings significant performance gains for finding platform patterns within a commonly used heterogeneous compute cluster configuration.
There is a great requirement for cryptographic systems to secure shared or transmitted data over the internet. Periodic modification of the currently used cryptographic scheme to encrypt the data is also suggested to ...
详细信息
Graph Neural Network (GNN) has emerged as a popular toolbox for solving complex problems on graph data structures. Graph neural networks use machine learning techniques to learn the vector representations of nodes and...
详细信息
ISBN:
(纸本)9781665494236
Graph Neural Network (GNN) has emerged as a popular toolbox for solving complex problems on graph data structures. Graph neural networks use machine learning techniques to learn the vector representations of nodes and/or edges. Learning these representations demands a huge amount of memory and computing power. The traditional shared-memory multiprocessors are insufficient to meet real-world data's computing requirements;hence, research has gained momentum toward distributed GNN. Scaling the distributed GNN has the following challenges: (1) the input graph needs to be efficiently partitioned, (2) the cost of communication between compute nodes should be reduced, and (3) the sampling strategy should be efficiently chosen to minimize the loss in accuracy. To address these challenges, we propose a joint partitioning and sampling algorithm, which partitions the input graph with weighted METIS and uses a bias sampling strategy to minimize total communication costs. We implemented our approach using the DistDGL framework and evaluated it using several real-world datasets. We observe that our approach (1) shows an average reduction in communication overhead by 53%, (2) requires less partitioning time to partition a graph, (3) shows improved accuracy, (4) shows a speed up of 1.5x on OGB-Arxiv dataset, when compared to the state-of-the-art DistDGL implementation.
Deep Learning (DL) has become a prominent machine learning technique due to the availability of efficient computational resources in the form of Graphics Processing Units (GPUs), large-scale datasets and a variety of ...
详细信息
ISBN:
(纸本)9781665494236
Deep Learning (DL) has become a prominent machine learning technique due to the availability of efficient computational resources in the form of Graphics Processing Units (GPUs), large-scale datasets and a variety of models. The newer generation of GPUs are being designed with special emphasis on optimizing performance for DL applications. Also, the availability of easy-to-use DL frameworks-like PyTorch and TensorFlowhas enhanced productivity of domain experts to work on their custom DL applications from diverse domains. However, existing Deep Neural Network (DNN) training approaches may not fully utilize the newly emerging powerful GPUs like the NVIDIA A100-this is the primary issue that we address in this paper. Our motivating analyses show that the GPU utilization on NVIDIA A100 can be as low as 43% using traditional DNN training approaches for small-to-medium DL models and input data size. This paper proposes AccDP-a data-paralleldistributed DNN training approach-to accelerate GPU-based DL applications. AccDP exploits the Message Passing Interface (MPI) communication library coupled with the NVIDIA's Multi-Process Service (MPS) to increase the amount of work assigned to parallel GPUs resulting in higher utilization of compute resources. We evaluate our proposed design on different small-to-medium DL models and input sizes on the state-of-the-art HPC clusters. By injecting more parallelism into DNN training using our approach, the evaluation shows up to 58% improvement in training performance on a single GPU and up to 62% on 16 GPUs compared to regular DNN training. Furthermore, we conduct an in-depth characterization to determine the impact of several DNN training factors and best practices-including the batch size and the number of data loading workers- to optimally utilize GPU devices. To the best of our knowledge, this is the first work that explores the use of MPS and MPI to maximize the utilization of GPUs in distributed DNN training.
Minimizing the round complexity of byzantine broadcast is a fundamental question in distributedcomputing and cryptography. In this work, we present the first early stopping byzantine broadcast protocol that tolerates...
详细信息
ISBN:
(纸本)9783031587337;9783031587344
Minimizing the round complexity of byzantine broadcast is a fundamental question in distributedcomputing and cryptography. In this work, we present the first early stopping byzantine broadcast protocol that tolerates up to t = n - 1 malicious corruptions and terminates in O(min{f(2), t+1}) rounds for any execution with f <= t actual corruptions. Our protocol is deterministic, adaptively secure, and works assuming a plain public key infrastructure. Prior early-stopping protocols all either require honest majority or tolerate only up to t = (1- epsilon)n malicious corruptions while requiring either trusted setup or strong number theoretic hardness assumptions. As our key contribution, we show a novel tool called a polariser that allows us to transfer certificate-based strategies from the honest majority setting to settings with a dishonest majority.
De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth ...
详细信息
ISBN:
(纸本)9781450397339
De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph, and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.
In this paper, we propose a distributed reservoir-computing based parallel nonlinear equalization for 100 Gb/s vertical cavity surface emitting laser (VCSEL) enabled optical interconnects. Equalization performance of ...
详细信息
ISBN:
(纸本)9781665481557
In this paper, we propose a distributed reservoir-computing based parallel nonlinear equalization for 100 Gb/s vertical cavity surface emitting laser (VCSEL) enabled optical interconnects. Equalization performance of proposed equalizer is compared with neural network and Volterra series based equalizers and similar performance can be achieved but with very neat and low computational complexity training process. Moreover, this approach, explained as small reservoirs make a mickle, is a scalable network generation solution that is promising for parallel hardware implementation.
For the implementation of the hardware structure of filter architecture, the area, power, and delay efficiency are needed. Memory complexity is also important in the 2D FIR filter architecture, while used for image pr...
详细信息
暂无评论