the proceedings contain 49 papers. the topics discussed include: Semi-StructMG: a fast and scalable semi-structured algebraic multigrid;LibRTS: a spatial indexing library by ray tracing;high-performance visual semanti...
ISBN:
(纸本)9798400714436
the proceedings contain 49 papers. the topics discussed include: Semi-StructMG: a fast and scalable semi-structured algebraic multigrid;LibRTS: a spatial indexing library by ray tracing;high-performance visual semantics compression for AI-driven science;COMPSO: optimizing gradient compression for distributed training with second-order optimizers;TurboFFT: co-designed high-performance and fault-tolerant fast Fourier transform on GPUs;Helios: efficient distributed dynamic graph sampling for online GNN inference;triangle counting on tensor cores;AC-Cache: a memory-efficient caching system for small objects via exploiting access correlations;magneto: accelerating parallel structures in DNNsvia co-optimization of operators;and FlashSparse: minimizing computation redundancy for fast sparse matrix multiplications on tensor cores.
the proceedings contain 58 papers. the topics discussed include: beyond human-level accuracy: computational challenges in deep learning;throughput-oriented GPU memory allocation;SEP-graph: finding shortest execution p...
ISBN:
(纸本)9781450362252
the proceedings contain 58 papers. the topics discussed include: beyond human-level accuracy: computational challenges in deep learning;throughput-oriented GPU memory allocation;SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU;incremental flattening for nested data parallelism;modular transactions: bounding mixed races in space and time;processing transactions in a predefined order;data-flow/dependence profiling for structured transformations;lightweight hardware transactional memory profiling;provably and practically efficient granularity control;semantics-aware scheduling policies for synchronization determinism;and a round-efficient distributed betweenness centrality algorithm.
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to opti...
详细信息
ISBN:
(纸本)9798400714436
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to optimize either of the kernels. However, these approaches often increase costs and contribute minimally to the overall alignment process. To address this, we have designed an optimized full pipeline, FastBWA, which seeks to enhance performance while keeping costs low and explores the potential of CPU computing resources. Our implementation demonstrates that FastBWA achieves up to 2.5x and 1.8x in end-to-end alignment throughput compared to BWA-MEM and its newer version, BWA-MEM2.
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-c...
详细信息
ISBN:
(纸本)9798400714436
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-chain or Layer-2 systems. parallel VMs offer a compelling solution by enabling concurrent transaction execution within a single smart contract, using multiple CPU cores. However, Ethereum's sequential, shared-everything model limits the efficiency of existing parallel mechanisms, resulting in frequent rollbacks with optimistic methods and high overhead with pessimistic methods due to state dependency analysis and locking. this paper introduces Crystality, a programming model for smart contracts on parallel Ethereum Virtual Machines (EVMs) that enables developers to express and leverage the parallelism inherent in smart contracts. Crystality introduces Programmable Contract Scopes to partition contract states into non-overlapping, parallelizable segments and decompose a smart contract function into finer-grained components. Crystality also features Asynchronous Functional Relay to manage execution flow across EVMs. these features simplify parallelism expression and enable asynchronous execution for commutative contract operations. Crystality extends Solidity with directives, transpiling Crystality code into standard Solidity code for EVM compatibility. the system supports two execution modes: an asynchronous mode for transactions involving commutative operations and an optimistic-based fallback to ensure blockdefined transaction order. Our experiments demonstrated Crystality's superior performance compared to Ethereum, Aptos, and Sui on a 64-core machine.
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to lim...
详细信息
ISBN:
(纸本)9798400714436
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. this paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02x and 4.19x, respectively.
there are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., withthe shortest running time), and the optimal strategy depend...
详细信息
ISBN:
(纸本)9798400714436
there are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., withthe shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://***/kaihaoma/APT.
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel...
详细信息
ISBN:
(纸本)9798400714436
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of mu s-scale and high-throughput distributed systems deployed in modern datacenters. We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining. We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5x better throughput and more than 150x and 300x better tail latency in non-contended and contended cases, respectively. We also compare DO-RADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Tes...
详细信息
ISBN:
(纸本)9798400714436
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular e...
详细信息
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not ...
详细信息
ISBN:
(纸本)9798400714436
As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated withthe popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
暂无评论