the proceedings contain 54 papers. the topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-co...
ISBN:
(纸本)9798400704161
the proceedings contain 54 papers. the topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-cost with rust?;efficient parallel reinforcement learning framework using the reactor model;parallel best arm identification in heterogeneous environments;brief announcement: lock-free learned search data structure;brief announcement: LIT: lookup interlocked table for range queries;brief announcement: a fast scalable detectable unrolled lock-based linked list;scheduling out-trees online to optimize maximum flow;optimizing dynamic data center provisioning through speed scaling: a primal-dual perspective;scheduling jobs with work-inefficient parallel solutions;and multi bucket queues: efficient concurrent priority scheduling.
the proceedings contain 47 papers. the topics discussed include: Quancurrent: a concurrent quantiles sketch;an efficient scheduler for task-parallel interactive applications;efficient synchronization-light work steali...
ISBN:
(纸本)9781450395458
the proceedings contain 47 papers. the topics discussed include: Quancurrent: a concurrent quantiles sketch;an efficient scheduler for task-parallel interactive applications;efficient synchronization-light work stealing;balanced allocations in batches: the tower of two choices;massively parallel tree embeddings for high dimensional spaces;deterministic massively parallel symmetry breaking for sparse graphs;an associativity threshold phenomenon in set-associative caches;increment - and - freeze: every cache, everywhere, all of the time;multidimensional approximate agreement with asynchronous fallback;a tight characterization of fast failover routing: resiliency to two link failures is possible;releasing memory with optimistic access: a hybrid approach to memory reclamation and allocation in lock-free programs;transactional composition of nonblocking data structures;applying hazard pointers to more concurrent data structures;and nearly optimal parallel algorithms for longest increasing subsequence.
the proceedings contain 44 papers. the topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth ...
ISBN:
(纸本)9781450391467
the proceedings contain 44 papers. the topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth graphs;adaptive massively parallel algorithms for cut problems;preparing for disaster: leveraging precomputation to efficiently repair graph structures upon failures;the energy complexity of Las Vegas leader election;a fully-distributed peer-to-peer protocol for byzantine-resilient distributed hash tables;brief announcement: the (limited) power of multiple identities: asynchronous byzantine reliable broadcast with improved resilience through collusion;brief announcement: composable dynamic secure emulation;and robust and optimal contention resolution without collision detection.
In recent years, the ever-increasing impact of memory access bottlenecks has brought forth a renewed interest in near-memory processing (NMP) architectures. In this work, we propose and empirically evaluate hybrid dat...
详细信息
ISBN:
(纸本)9781450391467
In recent years, the ever-increasing impact of memory access bottlenecks has brought forth a renewed interest in near-memory processing (NMP) architectures. In this work, we propose and empirically evaluate hybrid data structures, which are concurrent data structures custom-designed for these new NMP architectures. We focus on cache-optimized data structures, such as skiplists and B+ trees, that are often used as index structures in online transaction processing (OLTP) systems to enable fast key-based lookups. these data structures are hierarchical, where lookups begin at a small number of top-level nodes and diverge to many different node paths as they move down the hierarchy, such that nodes in higher levels benefit more from caching. Our proposed hybrid data structures split traditional hierarchical data structures into a host-managed portion consisting of higher-level nodes and an NMP-managed portion consisting of the remaining lower-level nodes, thus retaining and further enhancing the cache-conscious optimizations of their conventional implementations. Although the idea might seem relatively simple, the splitting of the data structure prompts new synchronization problems, and careful implementation is required to ensure high concurrency and correctness. We provide implementations of a hybrid skiplist and a hybrid B+ tree, and we empirically evaluate them on a cycle-accurate full-system architecture simulator. Our results show that the hybrid data structures have the potential to improve performance by more than 2x compared to state-of-the-art concurrent data structures.
the proceedings contain 26 papers. the topics discussed include: automated performance prediction of microservice applications using simulation;exact and efficient protective jamming in sinr-based wireless networks;de...
ISBN:
(纸本)9781665458382
the proceedings contain 26 papers. the topics discussed include: automated performance prediction of microservice applications using simulation;exact and efficient protective jamming in sinr-based wireless networks;deep learning models for automated identification of scheduling policies;energy-efficiency comparison of common sorting algorithms;scaling up the performance of distributed key-value stores with in-switch coordination;simulation modeling of urban e-scooter mobility;performance characterization of MPI Allreduce in cloud data center networks;enabling extremely fine-grained parallelism via scalable concurrent queues on modern many-core architectures;and mechanisms for transition from monolithic to distributed architecture in software development process.
Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchron...
详细信息
ISBN:
(纸本)9781665458382
Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900x improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. the reduced latency allows XQueue to provide orders of magnitude (3300x) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation withthe native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4x to 6x speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116x in some cases.
Uncertain or probabilistic graphs have been ubiquitously used in many emerging applications. Previously CPU based techniques were proposed to use sampling but suffer from (1) low computation efficiency and large memor...
详细信息
Partitioning a graph into blocks of "roughly equal"weight while cutting only few edges is a fundamental problem in computer science with a wide range of applications. In particular, the problem is a building...
详细信息
ISBN:
(纸本)9783959772044
Partitioning a graph into blocks of "roughly equal"weight while cutting only few edges is a fundamental problem in computer science with a wide range of applications. In particular, the problem is a building block in applications that require parallel processing. While the amount of available cores in parallel architectures has significantly increased in recent years, state-of-the-art graph partitioning algorithms do not work well if the input needs to be partitioned into a large number of blocks. Often currently available algorithms compute highly imbalanced solutions, solutions of low quality, or have excessive running time for this case. this is due to the fact that most high-quality general-purpose graph partitioners are multilevel algorithms which perform graph coarsening to build a hierarchy of graphs, initial partitioning to compute an initial solution, and local improvement to improve the solution throughout the hierarchy. However, for large number of blocks, the smallest graph in the hierarchy that is used for initial partitioning still has to be large. In this work, we substantially mitigate these problems by introducing deep multilevel graph partitioning and a shared-memory implementation thereof. Our scheme continues the multilevel approach deep into initial partitioning - integrating it into a framework where recursive bipartitioning and direct k-way partitioning are combined such that they can operate with high performance and quality. Our integrated approach is stronger, more flexible, arguably more elegant, and reduces bottlenecks for parallelization compared to existing multilevel approaches. For example, for large number of blocks our algorithm is on average at least an order of magnitude faster than competing algorithms while computing partitions with comparable solution quality. At the same time, our algorithm consistently produces balanced solutions. Moreover, for small number of blocks, our algorithms are the fastest among competing systems wit
Deep Neural Networks (DNNs) are fast becoming ubiquitous for their ability to attain good accuracy in various machine learning tasks. A DNN's architecture (i.e., its hyperparameters) broadly determines the DNN'...
详细信息
ISBN:
(纸本)9781939133175
Deep Neural Networks (DNNs) are fast becoming ubiquitous for their ability to attain good accuracy in various machine learning tasks. A DNN's architecture (i.e., its hyperparameters) broadly determines the DNN's accuracy and performance, and is often confidential. Attacking a DNN in the cloud to obtain its architecture can potentially provide major commercial value. Further, attaining a DNN's architecture facilitates other existing DNN attacks. this paper presents Cache Telepathy: an efficient mechanism to help obtain a DNN's architecture using the cache side channel. the attack is based on the insight that DNN inference relies heavily on tiled GEMM (Generalized Matrix Multiply), and that DNN architecture parameters determine the number of GEMM calls and the dimensions of the matrices used in the GEMM functions. Such information can be leaked through the cache side channel. this paper uses Prime+Probe and Flush+Reload to attack the VGG and ResNet DNNs running OpenBLAS and Intel MKL libraries. Our attack is effective in helping obtain the DNN architectures by very substantially reducing the search space of target DNN architectures. For example, when attacking the OpenBLAS library, for the different layers in VGG-16, it reduces the search space from more than 5.4 x 10(12) architectures to just 16;for the different modules in ResNet-50, it reduces the search space from more than 6 x 10(46) architectures to only 512.
the proceedings contain 47 papers. the topics discussed include: parallel minimum cuts in near-linear work and low depth;trees for vertex cuts, hypergraph cuts and minimum hypergraph bisection;dynamic representations ...
ISBN:
(纸本)9781450357999
the proceedings contain 47 papers. the topics discussed include: parallel minimum cuts in near-linear work and low depth;trees for vertex cuts, hypergraph cuts and minimum hypergraph bisection;dynamic representations of sparse distributed networks: a locality-sensitive approach;constant-depth and subcubic-size threshold circuits for matrix multiplication;integrated model, batch, and domain parallelism in training neural networks;brief announcement: on approximating pagerank locally with sublinear query complexity;brief announcement: coloring-based task mapping for dragonfly systems;brief announcement: parallel transitive closure within 3D crosspoint memory;and lock-free contention adapting search trees.
暂无评论