There is a growing need, for example in machine learning and analytics, to decompose applications into smaller schedulable units. Such decomposition can improve performance, reduce energy consumption, and increase res...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
There is a growing need, for example in machine learning and analytics, to decompose applications into smaller schedulable units. Such decomposition can improve performance, reduce energy consumption, and increase resource utilization. Unfortunately, enabling fine-grained parallelism comes with significant overheads and requires improvements at all layers of the programming stack. We consider the challenges of supporting fine-grained parallelism in the increasingly popular Python-based programming libraries. Specifically, we focus on Parsl, a Python library that is widely used to parallelize the execution of fine-grained Python functions. Parsl's Python-based runtime supports a maximum throughput of around 1200 tasks per second insufficient to meet modern application needs. We perform a comprehensive analysis of Parsl and identify areas that prohibit it from achieving higher throughput. We first profile Parsl components and identify that, with fine-grained tasks workers are often not saturated. We find that tasks spend a majority of their time in the components between the scheduler and worker, however, we also learned that the scheduler is capable of submitting thousands of tasks per second. We then focused on developing new optimizations and implementing crucial components in C to improve throughput. Our new implementation increases Parsl's throughput 6 fold.
The observation of the advancing and retreating pattern of polar sea ice cover stands as a vital indicator of global warming. This research aims to develop a robust, effective, and scalable system for classifying pola...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The observation of the advancing and retreating pattern of polar sea ice cover stands as a vital indicator of global warming. This research aims to develop a robust, effective, and scalable system for classifying polar sea ice as thick/snow-covered, young/thin, or open water using Sentinel-2 (S2) images. Since the 52 satellite is actively capturing high-resolution imagery over the earth's surface, there are lots of images that need to be classified. One major obstacle is the absence of labeled 52 training data (images) to act as the ground truth. We demonstrate a scalable and accurate method for segmenting and automatically labeling S2 images using carefully determined color thresholds. We employ a parallel workflow using PySpark to scale and achieve 9-fold data loading and 16-fold map-reduce speedup on auto-labeling S2 images based on thin cloud and shadow filtered color-based segmentation to generate label data. The auto-labeled data generated from this process are then employed to train a U-Net machine learning model, resulting in good classification accuracy. As training the U-Net classification model is computationally heavy and time-consuming, we distribute the U-Net model training to scale it over 8 GPLJs using the Horovod framework over a DGX cluster with a 7.2 lx speedup without affecting the accuracy of the model. Using the Antarctic's Ross Sea region as an example, the U-Net model trained on autolabeled data achieves a classification accuracy of 98.97% for auto-labeled training datasets when the thin clouds and shadows from the S2 images are filtered out.
The proceedings contain 76 papers. The special focus in this conference is on Network and parallel Computing. The topics include: AsymFB: Accelerating LLM Training Through Asymmetric Model parallelism;DaCP: Accelerati...
ISBN:
(纸本)9789819628636
The proceedings contain 76 papers. The special focus in this conference is on Network and parallel Computing. The topics include: AsymFB: Accelerating LLM Training Through Asymmetric Model parallelism;DaCP: Accelerating Synchronization-Free SpTRSV via GPU-Friendly Data Communication and parallelism Strategies;Diagnosability of the Lexicographic Product of Paths and Complete Bipartite Graphs Under PMC Model;DTuner: A Construction-Based Optimization Method for Dynamic Tensor Operators Accelerating;Efficient Implementation of the LOBPCG Algorithm on a CPU-GPU Cluster;HP-CSF: An GPU Optimization Method for CP Decomposition of Incomplete Tensors;JediGAN: A Fully Decentralized Training of GAN with Adaptive Discriminator Averaging and Generator Selection;optimizing Vo-Viso: A Modified Methodology to parallel Computing with Isolating Data in Memristor Arrays;parallel Computation of the Combination of Two Point Operations in Conic Curves Cryptosystem over GF(2n) Using Tile Self-assembly;parallel Construction of Independent Spanning Trees on 3-ary n-cube Networks;SpecInF: Exploiting Idle GPU Resources in distributed DL Training via Speculative Inference Filling;swDarknet: A Heterogeneous parallel Deep Learning Framework Suitable for SW26010 Pro Processor;VConv: Autotiling Convolution Algorithm Based on MLIR for Multi-core Vector accelerators;ACH-Code: An Efficient Erasure Code to Reduce Average Repair Cost in Cloud Storage systems of Multiple Availability Zones;CMS: A Computility Resource Status Management and Storage Framework;fast Memory Disaggregation with SwiftSwap;HASLB: Huge Page Allocation Strategy Optimized for Load-Balance in parallel Computing Programs;lightFinder: Finding Persistent Items with Small Memory;miDedup: A Restore-Friendly Deduplication Method on Docker Image Storage systems;SPLR: A Selective Packet Loss Recovery for Improved RDMA Performance;a Cluster-Based Platoon Formation Scheme for Realistic Automated Vehicle Platooning;AnaNET: Anatomical Network fo
The proceedings contain 76 papers. The special focus in this conference is on Network and parallel Computing. The topics include: AsymFB: Accelerating LLM Training Through Asymmetric Model parallelism;DaCP: Accelerati...
ISBN:
(纸本)9789819628292
The proceedings contain 76 papers. The special focus in this conference is on Network and parallel Computing. The topics include: AsymFB: Accelerating LLM Training Through Asymmetric Model parallelism;DaCP: Accelerating Synchronization-Free SpTRSV via GPU-Friendly Data Communication and parallelism Strategies;Diagnosability of the Lexicographic Product of Paths and Complete Bipartite Graphs Under PMC Model;DTuner: A Construction-Based Optimization Method for Dynamic Tensor Operators Accelerating;Efficient Implementation of the LOBPCG Algorithm on a CPU-GPU Cluster;HP-CSF: An GPU Optimization Method for CP Decomposition of Incomplete Tensors;JediGAN: A Fully Decentralized Training of GAN with Adaptive Discriminator Averaging and Generator Selection;optimizing Vo-Viso: A Modified Methodology to parallel Computing with Isolating Data in Memristor Arrays;parallel Computation of the Combination of Two Point Operations in Conic Curves Cryptosystem over GF(2n) Using Tile Self-assembly;parallel Construction of Independent Spanning Trees on 3-ary n-cube Networks;SpecInF: Exploiting Idle GPU Resources in distributed DL Training via Speculative Inference Filling;swDarknet: A Heterogeneous parallel Deep Learning Framework Suitable for SW26010 Pro Processor;VConv: Autotiling Convolution Algorithm Based on MLIR for Multi-core Vector accelerators;ACH-Code: An Efficient Erasure Code to Reduce Average Repair Cost in Cloud Storage systems of Multiple Availability Zones;CMS: A Computility Resource Status Management and Storage Framework;fast Memory Disaggregation with SwiftSwap;HASLB: Huge Page Allocation Strategy Optimized for Load-Balance in parallel Computing Programs;lightFinder: Finding Persistent Items with Small Memory;miDedup: A Restore-Friendly Deduplication Method on Docker Image Storage systems;SPLR: A Selective Packet Loss Recovery for Improved RDMA Performance;a Cluster-Based Platoon Formation Scheme for Realistic Automated Vehicle Platooning;AnaNET: Anatomical Network fo
As the size of datasets and neural network models increases, automatic parallelization methods for models have become a research hotspot in recent years. The existing auto-parallel methods based on machine learning or...
详细信息
A totally asynchronous gradient algorithm, with fixed step size is proposedfor federated learning. A mathematical model is presented and a convergence result is established. The convergence result is based on the conc...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
A totally asynchronous gradient algorithm, with fixed step size is proposedfor federated learning. A mathematical model is presented and a convergence result is established. The convergence result is based on the concept of macro iterations sequence. The interest of the contribution is to show that the asynchronous federated learning method converges when gradients of loss functions are updated by workers without order nor synchronization and with possible unbounded delays.
Accurately calculating the electronic structure of strongly correlated chemical systems necessitates a detailed description of both static and dynamical electron correlations, posing a significant challenge in ab init...
详细信息
ISBN:
(纸本)9798400717932
Accurately calculating the electronic structure of strongly correlated chemical systems necessitates a detailed description of both static and dynamical electron correlations, posing a significant challenge in ab initio quantum chemistry. Although the high memory and computational demands generally limit these calculations to relatively modest systems, the advanced computational capabilities of modern GPUs provide new avenues to expand these limits. However, complex control flows inherent to computation notably impair performance on GPUs. Furthermore, the significant disparity in computational load across different branches leads to load imbalance, challenging the large-scale simulations. In this work, we introduce PASCI, a heterogeneous parallel computing framework designed to quickly and efficiently parallelize the computation of dynamical correlation energy based on determinants. The features of the PASCI framework include (1) a divergence-avoiding GPU algorithm, (2) a three-level load-mapping strategy to ensure load balance across processors, GPU warps, and GPU threads, (3) performance models for memory footprint and computation, and (4) seamless integration with existing quantum chemistry software. Experimental results using an NVIDIA A100 GPU demonstrate that our new GPU algorithm achieves an average 6.6x (up to 13.8x) peak performance increase and 2-4 orders of magnitude speedup in practical usage compared to its original GPU implementation. Moreover, PASCI exhibits excellent scalability, highlighting its potential as a powerful high-performance computing tool in complex quantum chemistry research.
CephFS represents a prominent distributed file system that utilizes directory fragment migration to achieve improved runtime balance. However, its imprecise imbalance model and subtree selection algorithms can result ...
详细信息
Railway infrastructure plays a vital role in modern transportation systems, facilitating the efficient movement of people and goods. However, the integrity and performance of railroad structures are subject to various...
详细信息
暂无评论