The proceedings contain 50 papers. The topics discussed include: interval-based memory reclamation;harnessing epoch-based reclamation for efficient range queries;a persistent lock-free queue for non-volatile memory;hi...
ISBN:
(纸本)9781450349826
The proceedings contain 50 papers. The topics discussed include: interval-based memory reclamation;harnessing epoch-based reclamation for efficient range queries;a persistent lock-free queue for non-volatile memory;hierarchical memory management for mutable state;bridging the gap between deep learning and sparse matrix format selection;optimizing N-dimensional, winograd-based convolution for manycore CPUs;vSensor: leveraging fixed-workload snippets of programs for performance variance detection;featherlight on-the-fly false-sharing detection;communication-avoiding parallel minimum cuts and connected components;and an effective fusion and tile size model for optimizing image processing pipelines.
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (M...
详细信息
ISBN:
(纸本)9798400714436
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. This inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. This paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., similar to 3x compared with state-of-the-art.
The proceedings contain 21 papers. The topics discussed include: optimal schedules for parallel prefix computation with bounded resources;parallel-program transformation using a metalanguage;mapping concurrent program...
ISBN:
(纸本)0897913906
The proceedings contain 21 papers. The topics discussed include: optimal schedules for parallel prefix computation with bounded resources;parallel-program transformation using a metalanguage;mapping concurrent programs to VLIW Processors;a unified framework for systematic loop transformations;scanning polyhedra with DO loops;removal of redundant dependence in DOACROSS constant dependence;exploitation of APL data parallelism on a shared-memory MIMD machine;Andorra-I: a parallel prolog system that transparently exploits both and- and or-parallelism;and scalable reader- writer synchronization for shared-memory multiprocessors.
Many sequential loops are actually scans or reductions and can be parallelized across iterations despite the loop-carried dependences. In this work, we consider the parallelization of such scan/reduction loops, and pr...
详细信息
ISBN:
(纸本)9781450349826
Many sequential loops are actually scans or reductions and can be parallelized across iterations despite the loop-carried dependences. In this work, we consider the parallelization of such scan/reduction loops, and propose a practical runtime approach called sampling-and-reconstruction to extract the hidden scan/reduction patterns in these loops.
This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the sch...
详细信息
ISBN:
(纸本)9781450349826
This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the scheduling overhead of our techniques is 43% lower than Intel OpenMP and 12.1 x lower than Cilk. We observe 22% speedup on 48 threads, with a peak of 2.8 x speedup.
Interaction with physical objects often imposes latency requirements to multi-core embedded systems. One consequence is the need for synchronisation algorithms that provide predictable latency, in addition to high thr...
详细信息
ISBN:
(纸本)9781450349826
Interaction with physical objects often imposes latency requirements to multi-core embedded systems. One consequence is the need for synchronisation algorithms that provide predictable latency, in addition to high throughput. We present a synchronisation algorithm that needs at most 7 atomic memory operations per asynchronous critical section. The performance is competitive, at least, to locks.
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithm...
详细信息
ISBN:
(纸本)9781450349826
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We present a new concurrent work queue, the Broker Queue, a highly efficient, linearizable queue for fine-granular work distribution on the GPU. We evaluate its usability and benefits in contrast to existing queuing algorithms. Our queue is up to one order of magnitude faster than non-blocking queues, and outperforms simpler queue designs that are unfit for fine-granular work distribution.
The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the dev...
详细信息
ISBN:
(纸本)9781450349826
The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12x on average with respect to the state-of-the-art.
We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing. To take advantage of future massively parallel ...
详细信息
ISBN:
(纸本)9781450349826
We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing. To take advantage of future massively parallel architectures, our algorithms are communication-avoiding: they reduce the costs of communication across the network and the cache hierarchy. The fundamental technique underlying our work is the randomized sparsification of a graph: removing a fraction of graph edges, deriving a solution for such a sparsified graph, and using the result to obtain a solution for the original input. We design and implement sparsification with O(1) synchronization steps. Our global minimum cut algorithm decreases communication costs and computation compared to the state-of-the-art, while our connected components algorithm incurs few cache misses and synchronization steps. We validate our approach by evaluating MPI implementations of the algorithms on a petascale supercomputer. We also provide an approximate variant of the minimum cut algorithm and show that it approximates the exact solutions well while using a fraction of cores in a fraction of time.
Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filterin...
详细信息
ISBN:
(纸本)9781450349826
Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical part of these frameworks. We describe an interface for ordered maps that is augmented to support fast range queries and sums, and introduce a parallel and concurrent library called PAM (parallel Augmented Maps) that implements the interface. The interface includes a wide variety of functions on augmented maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, filtering, extracting ranges, splitting, and range-sums. We describe algorithms for these functions that are efficient both in theory and practice. As examples of the use of the interface and the performance of PAM we apply the library to four applications: simple range sums, interval trees, 2D range trees, and ranked word index searching. The interface greatly simplifies the implementation of these data structures over direct implementations. Sequentially the code achieves performance that matches or exceeds existing libraries designed specially for a single application, and in parallel our implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.
暂无评论