the proceedings contain 46 papers. the topics discussed include: kite: efficient and available release consistency for the datacenter;Oak: a scalable off-heap allocated key-value map;optimizing batched Winograd convol...
the proceedings contain 46 papers. the topics discussed include: kite: efficient and available release consistency for the datacenter;Oak: a scalable off-heap allocated key-value map;optimizing batched Winograd convolution on GPUs;taming unbalanced training workloads in deep learning with partial collective operations;scalable top-K retrieval with Sparta;waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data;scaling concurrent queues by using HTM to profit from failed atomic operations;a wait-free universal construction for large objects;using sample-based time series data for automated diagnosis of scalability losses in parallel programs;scaling out speculative execution of finite-state machines withparallel merge;and detecting and reproducing error-code propagation bugs in MPI implementations.
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Tes...
详细信息
ISBN:
(纸本)9798400714436
Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel...
详细信息
ISBN:
(纸本)9798400714436
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of mu s-scale and high-throughput distributed systems deployed in modern datacenters. We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining. We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5x better throughput and more than 150x and 300x better tail latency in non-contended and contended cases, respectively. We also compare DO-RADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (M...
详细信息
ISBN:
(纸本)9798400714436
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. this inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. this paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., similar to 3x compared with state-of-the-art.
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular e...
详细信息
Integer sorting is a fundamental problem in computer science. this paper studies parallel integer sort both in theory and in practice. In theory, we show tighter bounds for a class of existing practical integer sort a...
详细信息
ISBN:
(纸本)9798400704352
Integer sorting is a fundamental problem in computer science. this paper studies parallel integer sort both in theory and in practice. In theory, we show tighter bounds for a class of existing practical integer sort algorithms, which provides a solid theoretical foundation for their widespread usage in practice and strong performance. In practice, we design a new integer sorting algorithm, DovetailSort, that is theoreticallyefficient and has good practical performance. In particular, DovetailSort overcomes a common challenge in existing parallel integer sorting algorithms, which is the difficulty of detecting and taking advantage of duplicate keys. the key insight in DovetailSort is to combine algorithmic ideas from both integer- and comparison-sorting algorithms. In our experiments, DovetailSort achieves competitive or better performance than existing state-of-the-art parallel integer and comparison sorting algorithms on various synthetic and real-world datasets.
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to mak...
详细信息
ISBN:
(纸本)9798400704352
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.
the proceedings contain 46 papers. the topics discussed include: kite: efficient and available release consistency for the datacenter;oak: a scalable off-heap allocated key-value map;taming unbalanced training workloa...
ISBN:
(纸本)9781450368186
the proceedings contain 46 papers. the topics discussed include: kite: efficient and available release consistency for the datacenter;oak: a scalable off-heap allocated key-value map;taming unbalanced training workloads in deep learning with partial collective operations;scalable top-k retrieval with sparta;waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data;scaling concurrent queues by using HTM to profit from failed atomic operations;a wait-free universal construction for large objects;universal wait-free memory reclamation;and using sample-based time series data for automated diagnosis of scalability losses in parallel programs.
By identifying the.. largest or smallest elements in a set of data, top-k selection is critical for modern high-performance databases and machine learning systems, especially with large data volumes. However, previous...
详细信息
ISBN:
(纸本)9798400704352
By identifying the.. largest or smallest elements in a set of data, top-k selection is critical for modern high-performance databases and machine learning systems, especially with large data volumes. However, previous studies on its GPU implementation are mostly merge-based and rely heavily on the high-speed but size-limited on-chip memory, thereby resulting in a restricted upper bound on... this paper introduces RadiK, a highly optimized GPU-parallel radix top-k selection that is scalable with.., input length, and batch size. With a carefully designed optimization framework targeting high memory bandwidth and resource utilization, RadiK supports far larger.. than the prior art, achieving up to 2.5x speedup for non-batch queries and up to 4.8x speedup for batch queries. We also propose a lightweight refinement that strengthens the robustness of RadiK against skewed distributions by adaptively scaling the input elements.
Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes becau...
详细信息
ISBN:
(纸本)9798400704352
Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes because they are oblivious to the communication cost variations across the interconnection hierarchy. We introduce GraphCube, a better approach to optimizing graph processing on large-scale parallel systems with complex interconnections. GraphCube features a new graph partitioning approach to achieve better load balancing and minimize communication overhead across multiple levels of the interconnection hierarchy. We evaluate GraphCube by applying it to fundamental graph operations performed on synthetic and real-world graph datasets. Our evaluation used up to 79,024 computing nodes and 1.2+ million processor cores. Our large-scale experiments show that GraphCube outperforms state-of-the-art parallel graph processing methods in throughput and scalability. Furthermore, GraphCube outperformed the top-ranked systems on the Graph 500 list.
暂无评论