The proceedings contain 49 papers. The topics discussed include: Semi-StructMG: a fast and scalable semi-structured algebraic multigrid;LibRTS: a spatial indexing library by ray tracing;high-performance visual semanti...
ISBN:
(纸本)9798400714436
The proceedings contain 49 papers. The topics discussed include: Semi-StructMG: a fast and scalable semi-structured algebraic multigrid;LibRTS: a spatial indexing library by ray tracing;high-performance visual semantics compression for AI-driven science;COMPSO: optimizing gradient compression for distributed training with second-order optimizers;TurboFFT: co-designed high-performance and fault-tolerant fast Fourier transform on GPUs;Helios: efficient distributed dynamic graph sampling for online GNN inference;triangle counting on tensor cores;AC-Cache: a memory-efficient caching system for small objects via exploiting access correlations;magneto: accelerating parallel structures in DNNsvia co-optimization of operators;and FlashSparse: minimizing computation redundancy for fast sparse matrix multiplications on tensor cores.
The proceedings contain 44 papers. The topics discussed include: FastFold: optimizing AlphaFold training and inference on GPU clusters;liger: interleaving intra- and inter-operator parallelism for distributed large mo...
ISBN:
(纸本)9798400704352
The proceedings contain 44 papers. The topics discussed include: FastFold: optimizing AlphaFold training and inference on GPU clusters;liger: interleaving intra- and inter-operator parallelism for distributed large model inference;optimizing collective communications with error-bounded lossy compression for GPU clusters;OsirisBFT: say no to task replication for scalable byzantine fault tolerant analytics;RELAX: durable data structures with swift recovery;a row decomposition-based approach for sparse matrix multiplication on GPUs;Tetris: accelerating sparse convolution by exploiting memory reuse on GPU;scaling up transactions with slower clocks;towards scalable unstructured mesh computations on shared memory many-cores;AGAThA: fast and efficient GPU acceleration of guided sequence alignment for long read mapping;and shared memory-contention-aware concurrent DNN execution for diversely heterogeneous system-on-chips.
The proceedings contain 43 papers. The topics discussed include: provably good randomized strategies for data placement in distributed key-value stores;provably fast and space-efficient parallel biconnectivity;practic...
ISBN:
(纸本)9798400700156
The proceedings contain 43 papers. The topics discussed include: provably good randomized strategies for data placement in distributed key-value stores;provably fast and space-efficient parallel biconnectivity;practically and theoretically efficient garbage collection for multiversioning;fast and scalable channels in Kotlin coroutines;high-performance GPU-to-CPU transpilation and optimization via high-level parallel constructs;lifetime-based optimization for simulating quantum circuits on a new Sunway supercomputer;merchandiser: data placement on heterogeneous memory for task-parallel HPC applications with load-balance awareness;visibility algorithms for dynamic dependence analysis and distributed coherence;Block-STM: scaling blockchain execution by turning ordering curse to a performance blessing;TDC: towards extremely efficient CNNs on GPUs via hardware-aware tucker decomposition;and improving energy saving of one-sided matrix decompositions on CPU-GPU heterogeneous systems.
The proceedings contain 46 papers. The topics discussed include: stream processing with dependency-guided synchronization;mashup: making serverless computing useful for HPC workflows via hybrid execution;parallel bloc...
ISBN:
(纸本)9781450392044
The proceedings contain 46 papers. The topics discussed include: stream processing with dependency-guided synchronization;mashup: making serverless computing useful for HPC workflows via hybrid execution;parallel block-delayed sequences;near-optimal sparse Allreduce for distributed deep learning;Vapro: performance variance detection and diagnosis for production-run parallel applications;interference relation-guided SMT solving for multi-threaded program verification;extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms;scaling graph traversal to 281 trillion edges with 40 million cores;asymmetry-aware scalable locking;the performance power of software combining in persistence;and multi-queues can be state-of-the-art priority schedulers.
The proceedings contain 48 papers. The topics discussed include: efficient algorithms for persistent transactional memory;investigating the semantics of futures in transactional memory systems;constant-time snapshots ...
ISBN:
(纸本)9781450382946
The proceedings contain 48 papers. The topics discussed include: efficient algorithms for persistent transactional memory;investigating the semantics of futures in transactional memory systems;constant-time snapshots with applications to concurrent data structures;reasoning about recursive tree traversals;synthesizing optimal collective algorithms;scaling implicit parallelism via dynamic control replication;efficiently reclaiming memory in concurrent search data structures while bounding wasted memory;are dynamic memory managers on GPUs slow? a survey and benchmarks;improving communication by optimizing on-node data movement with data layout;and Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory.
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to opti...
详细信息
ISBN:
(纸本)9798400714436
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to optimize either of the kernels. However, these approaches often increase costs and contribute minimally to the overall alignment process. To address this, we have designed an optimized full pipeline, FastBWA, which seeks to enhance performance while keeping costs low and explores the potential of CPU computing resources. Our implementation demonstrates that FastBWA achieves up to 2.5x and 1.8x in end-to-end alignment throughput compared to BWA-MEM and its newer version, BWA-MEM2.
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to lim...
详细信息
ISBN:
(纸本)9798400714436
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. This paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02x and 4.19x, respectively.
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depend...
详细信息
ISBN:
(纸本)9798400714436
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://***/kaihaoma/APT.
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-c...
详细信息
ISBN:
(纸本)9798400714436
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-chain or Layer-2 systems. parallel VMs offer a compelling solution by enabling concurrent transaction execution within a single smart contract, using multiple CPU cores. However, Ethereum's sequential, shared-everything model limits the efficiency of existing parallel mechanisms, resulting in frequent rollbacks with optimistic methods and high overhead with pessimistic methods due to state dependency analysis and locking. This paper introduces Crystality, a programming model for smart contracts on parallel Ethereum Virtual Machines (EVMs) that enables developers to express and leverage the parallelism inherent in smart contracts. Crystality introduces Programmable Contract Scopes to partition contract states into non-overlapping, parallelizable segments and decompose a smart contract function into finer-grained components. Crystality also features Asynchronous Functional Relay to manage execution flow across EVMs. These features simplify parallelism expression and enable asynchronous execution for commutative contract operations. Crystality extends Solidity with directives, transpiling Crystality code into standard Solidity code for EVM compatibility. The system supports two execution modes: an asynchronous mode for transactions involving commutative operations and an optimistic-based fallback to ensure blockdefined transaction order. Our experiments demonstrated Crystality's superior performance compared to Ethereum, Aptos, and Sui on a 64-core machine.
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Tes...
详细信息
ISBN:
(纸本)9798400714436
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.
暂无评论