the proceedings contain 54 papers. the topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-co...
ISBN:
(纸本)9798400704161
the proceedings contain 54 papers. the topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-cost with rust?;efficient parallel reinforcement learning framework using the reactor model;parallel best arm identification in heterogeneous environments;brief announcement: lock-free learned search data structure;brief announcement: LIT: lookup interlocked table for range queries;brief announcement: a fast scalable detectable unrolled lock-based linked list;scheduling out-trees online to optimize maximum flow;optimizing dynamic data center provisioning through speed scaling: a primal-dual perspective;scheduling jobs with work-inefficient parallel solutions;and multi bucket queues: efficient concurrent priority scheduling.
the proceedings contain 47 papers. the topics discussed include: Quancurrent: a concurrent quantiles sketch;an efficient scheduler for task-parallel interactive applications;efficient synchronization-light work steali...
ISBN:
(纸本)9781450395458
the proceedings contain 47 papers. the topics discussed include: Quancurrent: a concurrent quantiles sketch;an efficient scheduler for task-parallel interactive applications;efficient synchronization-light work stealing;balanced allocations in batches: the tower of two choices;massively parallel tree embeddings for high dimensional spaces;deterministic massively parallel symmetry breaking for sparse graphs;an associativity threshold phenomenon in set-associative caches;increment - and - freeze: every cache, everywhere, all of the time;multidimensional approximate agreement with asynchronous fallback;a tight characterization of fast failover routing: resiliency to two link failures is possible;releasing memory with optimistic access: a hybrid approach to memory reclamation and allocation in lock-free programs;transactional composition of nonblocking data structures;applying hazard pointers to more concurrent data structures;and nearly optimal parallel algorithms for longest increasing subsequence.
the proceedings contain 44 papers. the topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth ...
ISBN:
(纸本)9781450391467
the proceedings contain 44 papers. the topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth graphs;adaptive massively parallel algorithms for cut problems;preparing for disaster: leveraging precomputation to efficiently repair graph structures upon failures;the energy complexity of Las Vegas leader election;a fully-distributed peer-to-peer protocol for byzantine-resilient distributed hash tables;brief announcement: the (limited) power of multiple identities: asynchronous byzantine reliable broadcast with improved resilience through collusion;brief announcement: composable dynamic secure emulation;and robust and optimal contention resolution without collision detection.
Vector search has drawn a rapid increase of interest in the research community due to its application in novel AI applications. Maximizing its performance is essential for many tasks but remains preliminary understood...
详细信息
In recent years, the ever-increasing impact of memory access bottlenecks has brought forth a renewed interest in near-memory processing (NMP) architectures. In this work, we propose and empirically evaluate hybrid dat...
详细信息
ISBN:
(纸本)9781450391467
In recent years, the ever-increasing impact of memory access bottlenecks has brought forth a renewed interest in near-memory processing (NMP) architectures. In this work, we propose and empirically evaluate hybrid data structures, which are concurrent data structures custom-designed for these new NMP architectures. We focus on cache-optimized data structures, such as skiplists and B+ trees, that are often used as index structures in online transaction processing (OLTP) systems to enable fast key-based lookups. these data structures are hierarchical, where lookups begin at a small number of top-level nodes and diverge to many different node paths as they move down the hierarchy, such that nodes in higher levels benefit more from caching. Our proposed hybrid data structures split traditional hierarchical data structures into a host-managed portion consisting of higher-level nodes and an NMP-managed portion consisting of the remaining lower-level nodes, thus retaining and further enhancing the cache-conscious optimizations of their conventional implementations. Although the idea might seem relatively simple, the splitting of the data structure prompts new synchronization problems, and careful implementation is required to ensure high concurrency and correctness. We provide implementations of a hybrid skiplist and a hybrid B+ tree, and we empirically evaluate them on a cycle-accurate full-system architecture simulator. Our results show that the hybrid data structures have the potential to improve performance by more than 2x compared to state-of-the-art concurrent data structures.
the proceedings contain 28 papers. the topics discussed include: a compiler framework for optimizing dynamic parallelism on GPUs;a compiler for sound floating-point computations using affine arithmetic;aggregate updat...
ISBN:
(纸本)9781665405843
the proceedings contain 28 papers. the topics discussed include: a compiler framework for optimizing dynamic parallelism on GPUs;a compiler for sound floating-point computations using affine arithmetic;aggregate update problem for multi-clocked dataflow languages;palmed: throughput characterization for superscalar architectures;automatic generation of debug headers through Blackbox equivalence checking;gadgets splicing: dynamic binary transformation for precise rewriting;lambda the ultimate SSA: optimizing functional programs in SSA;and HECATE: performance-aware scale optimization for homomorphic encryption compiler.
To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featur...
详细信息
ISBN:
(纸本)9781665420273
To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multitensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN's hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials. In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, har
Graph convolutional networks (GCNs) are promising to enable machine learning on graphs. GCNs exhibit mixed computational kernels, involving regular neural-network-like computing and irregular graph-analytics-like proc...
详细信息
ISBN:
(纸本)9781665420273
Graph convolutional networks (GCNs) are promising to enable machine learning on graphs. GCNs exhibit mixed computational kernels, involving regular neural-network-like computing and irregular graph-analytics-like processing. Existing GCN accelerators obey a divide-and-conquer philosophy to architect two separate types of hardware to accelerate these two types of GCN kernels, respectively. this hybrid architecture improves intra-kernel efficiency but considers little inter-kernel interactions in a holistic view for improving overall efficiency. In this paper, we present a new GCN accelerator, REFLIP, withthree key innovations in terms of architecture design, algorithm mappings, and practical implementations. First, REFLIP leverages PIM-featured crossbar architectures to build a unified architecture for supporting the two types of GCN kernels simultaneously. Second, REFLIP adopts novel algorithm mappings that can maximize potential performance gains reaped from the unified architecture by exploiting the massive crossbar-structured parallelism. third, REFLIP assembles software/hardware co-optimizations to process real-world graphs efficiently. Compared to the state-of-the-art software frameworks running on Intel Xeon E5-2680v4 CPU and NVIDIA Tesla V100 GPU, REFLIP achieves the average speedups of 6,432 x and 86.32 x and the average energy savings of 9,817x and 302.44 x, respectively. In addition, REFLIP also outperforms a state-of-the-art GCN hardware accelerator, AWB-GCN, by achieving an average speedup of 5.06 x and an average energy saving of 15.63x.
Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length ...
详细信息
ISBN:
(纸本)9781665420273
Contemporary Vector Processors (VPs) are designed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL1). Unfortunately, both approaches have drawbacks. On the one hand, short vector length VP designs struggle to provide high efficiency for applications featuring long vectors with high Data Level parallelism (DLP). On the other hand, long vector VP designs waste resources and underutilize the Vector Register File (VRF) when executing low DLP applications with short vector lengths. therefore, those long vector VP implementations are limited to a specialized subset of applications, where relatively high DLP must be present to achieve excellent performance with high efficiency. Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. To overcome these limitations, we propose an Adaptable Vector Architecture (AVA) that leads to having the best of both worlds. AVA is designed for short vectors (MVL=16 elements) and is thus area and energy-efficient. However, AVA has the functionality to reconfigure the MVL, thereby allowing to exploit the benefits of having a longer vector of up to 128 elements microarchitecture when abundant DLP is present. We model AVA on the gem5 simulator and evaluate AVA performance with six applications taken from the RiVEC Benchmark Suite. To obtain area and power consumption metrics, we model AVA on McPAT for 22nm technology. Our results show that by reconfiguring our small VRF (8KB) plus our novel issue queue scheme, AVA yields a 2X speedup over the default configuration for short vectors. Additionally, AVA shows competitive performance when compared to a long vector VP, while saving 50% of area.
Uncertain or probabilistic graphs have been ubiquitously used in many emerging applications. Previously CPU based techniques were proposed to use sampling but suffer from (1) low computation efficiency and large memor...
详细信息
暂无评论