The proceedings contain 46 papers. The topics discussed include: kite: efficient and available release consistency for the datacenter;oak: a scalable off-heap allocated key-value map;taming unbalanced training workloa...
ISBN:
(纸本)9781450368186
The proceedings contain 46 papers. The topics discussed include: kite: efficient and available release consistency for the datacenter;oak: a scalable off-heap allocated key-value map;taming unbalanced training workloads in deep learning with partial collective operations;scalable top-k retrieval with sparta;waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data;scaling concurrent queues by using HTM to profit from failed atomic operations;a wait-free universal construction for large objects;universal wait-free memory reclamation;and using sample-based time series data for automated diagnosis of scalability losses in parallel programs.
The proceedings contain 13 papers. The topics discussed include: a guided walk into link key candidate extraction with relational concept analysis;reflections on profiling and cataloguing the content of SPARQL endpoin...
The proceedings contain 13 papers. The topics discussed include: a guided walk into link key candidate extraction with relational concept analysis;reflections on profiling and cataloguing the content of SPARQL endpoints using SPORTAL;reflections on: modeling linked open statistical data;reflections on: DCAT-AP representation of Czech national open data catalog and its impact;reflections on: deep learning for noise-tolerant RDFS reasoning;reflections on: finding melanoma drugs through a probabilistic knowledge graph;reflections on: knowledge graph fact prediction via knowledge-enriched tensor factorization;the semantic sensor network ontology, revamped;and reflections on: knowmore - knowledge base augmentation with structured web markup.
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to opti...
详细信息
ISBN:
(纸本)9798400714436
Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to optimize either of the kernels. However, these approaches often increase costs and contribute minimally to the overall alignment process. To address this, we have designed an optimized full pipeline, FastBWA, which seeks to enhance performance while keeping costs low and explores the potential of CPU computing resources. Our implementation demonstrates that FastBWA achieves up to 2.5x and 1.8x in end-to-end alignment throughput compared to BWA-MEM and its newer version, BWA-MEM2.
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to lim...
详细信息
ISBN:
(纸本)9798400714436
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. This paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02x and 4.19x, respectively.
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-c...
详细信息
ISBN:
(纸本)9798400714436
Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-chain or Layer-2 systems. parallel VMs offer a compelling solution by enabling concurrent transaction execution within a single smart contract, using multiple CPU cores. However, Ethereum's sequential, shared-everything model limits the efficiency of existing parallel mechanisms, resulting in frequent rollbacks with optimistic methods and high overhead with pessimistic methods due to state dependency analysis and locking. This paper introduces Crystality, a programming model for smart contracts on parallel Ethereum Virtual Machines (EVMs) that enables developers to express and leverage the parallelism inherent in smart contracts. Crystality introduces Programmable Contract Scopes to partition contract states into non-overlapping, parallelizable segments and decompose a smart contract function into finer-grained components. Crystality also features Asynchronous Functional Relay to manage execution flow across EVMs. These features simplify parallelism expression and enable asynchronous execution for commutative contract operations. Crystality extends Solidity with directives, transpiling Crystality code into standard Solidity code for EVM compatibility. The system supports two execution modes: an asynchronous mode for transactions involving commutative operations and an optimistic-based fallback to ensure blockdefined transaction order. Our experiments demonstrated Crystality's superior performance compared to Ethereum, Aptos, and Sui on a 64-core machine.
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depend...
详细信息
ISBN:
(纸本)9798400714436
There are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://***/kaihaoma/APT.
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel...
详细信息
ISBN:
(纸本)9798400714436
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of mu s-scale and high-throughput distributed systems deployed in modern datacenters. We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining. We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5x better throughput and more than 150x and 300x better tail latency in non-contended and contended cases, respectively. We also compare DO-RADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.
暂无评论