The proceedings contain 27 papers. The topics discussed include: SZKP: a scalable accelerator architecture for zero-knowledge proofs;recompiling QAOA circuits on various rotational directions;MIREncoder: multi-modal I...
ISBN:
(纸本)9798400706318
The proceedings contain 27 papers. The topics discussed include: SZKP: a scalable accelerator architecture for zero-knowledge proofs;recompiling QAOA circuits on various rotational directions;MIREncoder: multi-modal IR-based pretrained embeddings for performance optimizations;a parallel hash table for streaming applications;PipeGen: automated transformation of a single-core pipeline into a multicore pipeline for a given memory consistency model;NavCim: comprehensive design space exploration for analog computing-in-memory architectures;optimizing tensor computation graphs with equality saturation and Monte Carlo tree search;chimera: leveraging hybrid offsets for efficient data prefetching;toast: a heterogeneous memory management system;and a transducers-based programming framework for efficient data transformation.
The proceedings contain 50 papers. The topics discussed include: Com-CAS: effective cache apportioning under compiler guidance;transfer-tuning: reusing auto-schedules for efficient tensor program code generation;slice...
ISBN:
(纸本)9781450398688
The proceedings contain 50 papers. The topics discussed include: Com-CAS: effective cache apportioning under compiler guidance;transfer-tuning: reusing auto-schedules for efficient tensor program code generation;slice-and-forge: making better use of caches for graph convolutional network accelerators;GNNear: accelerating full-batch training of graph neural networks with near-memory processing;T-GCN: a sampling based streaming graph neural network system with hybrid architecture;optimizing aggregate computation of graph neural networks with On-GPU interpreter-style programming;Pavise: integrating fault tolerance support for persistent memory applications;efficient atomic durability on eADR-enabled persistent memory;custom high-performance vector code generation for data-specific sparse computations;and decoupling scheduler, topology layout, and algorithm to easily enlarge the tuning space of GPU graph processing.
The proceedings contain 25 papers. The topics discussed include: a flexible approach to autotuning multi-pass machine learning compilers;program lifting using gray-box behavior;HERTI: a reinforcement learning-augmente...
ISBN:
(纸本)9781665442787
The proceedings contain 25 papers. The topics discussed include: a flexible approach to autotuning multi-pass machine learning compilers;program lifting using gray-box behavior;HERTI: a reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems;X-Layer: building composable pipelined dataflows for low-rank convolutions;precision batching: bitserial decomposition for efficient neural network inference on GPUs;google neural network models for edge devices: analyzing and mitigating machine learning inference bottlenecks;and ultra efficient acceleration for de novo genome assembly via near-memory computing.
The proceedings contain 46 papers. The topics discussed include: TAFE: thread address footprint estimation for capturing data/thread locality in GPU systems;SparseRT: accelerating unstructured sparsity on GPUs for dee...
ISBN:
(纸本)9781450380751
The proceedings contain 46 papers. The topics discussed include: TAFE: thread address footprint estimation for capturing data/thread locality in GPU systems;SparseRT: accelerating unstructured sparsity on GPUs for deep learning inference;GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU;exploring the design space of static and incremental graph connectivity algorithms on GPUs;Fireiron: a data-movement-aware scheduling language for GPUs;automatic generation of multi-objective polyhedral compiler transformations;bandwidth-aware loop tiling for DMA-supported scratchpad memory;deep program structure modeling through multi-relational graph-based learning;intelligent data placement on discrete GPU nodes with unified memory;deep learning assisted resource partitioning for improving performance on commodity servers;and decoupled address translation for heterogeneous memory systems.
The proceedings contain 33 papers. The topics discussed include: MBAPIS: multi-level behavior analysis guided program interval selection for microarchitecture studies;automatic code generation for high-performance gra...
ISBN:
(纸本)9798350342543
The proceedings contain 33 papers. The topics discussed include: MBAPIS: multi-level behavior analysis guided program interval selection for microarchitecture studies;automatic code generation for high-performance graph algorithms;SimplePIM: a software framework for productive and efficient processing-in-memory;Drishyam: an image is worth a data prefetcher;architecture-aware currying;PreFlush: lightweight hardware prediction mechanism for cache line flush and writeback;retargeting applications for heterogeneous systems with the tribble source-to-source framework;dynamic allocation of processor cores to graph applications on commodity servers;parallelizing maximal clique enumeration on GPUs;and HugeGPT: storing guest page tables on host huge pages to accelerate address translation.
The proceedings contain 36 papers. The topics discussed include: massively parallel skyline computation for processing-in-memory architectures;data motifs: a lens towards fully understanding big data and ai workloads;...
ISBN:
(纸本)9781450359863
The proceedings contain 36 papers. The topics discussed include: massively parallel skyline computation for processing-in-memory architectures;data motifs: a lens towards fully understanding big data and ai workloads;performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis;synergistic cache layout for reuse and compression;an efficient graph accelerator with parallel data conflict management;revealing parallel scans and reductions in recurrences through function reconstruction;compiler assisted coalescing;stencil codes on a vector length agnostic architecture;maximizing system utilization via parallelism management for co-located parallel applications;a portable, automatic data quantizer for deep neural networks;mage: online interference-aware scheduling in multi-scale heterogeneous systems;and towards concurrency race debugging: an integrated approach of constraint solving and dynamic slicing.
Data movement limits program performance. This bottleneck is more significant in multi-thread programs but more difficult to analyze, especially for multiple thread counts. For regular loop nests parallelized by OpenM...
详细信息
ISBN:
(数字)9798400706318
ISBN:
(纸本)9798400706318
Data movement limits program performance. This bottleneck is more significant in multi-thread programs but more difficult to analyze, especially for multiple thread counts. For regular loop nests parallelized by OpenMP, this paper presents a new technique that predicts their miss ratio in the shared cache. It uses two statistical models, one for cache sharing and one for data sharing. Both models use a symbolic number of threads, making it trivial to compute the miss ratio of any additional thread count after initial analysis. The technique is implemented in a tool called PLUSS. When tested on 73 parallel loops used in scientific kernels, image processing and machine learning, PLUSS produces accurate results compared to profiling and reduces the analysis cost by up to two orders of magnitude.
Hash Tables are important data structures for a wide range of data intensive applications in various domains. They offer compact storage for sparse data, but their performance has difficulties to scale with the rapidl...
详细信息
ISBN:
(数字)9798400706318
ISBN:
(纸本)9798400706318
Hash Tables are important data structures for a wide range of data intensive applications in various domains. They offer compact storage for sparse data, but their performance has difficulties to scale with the rapidly increasing volumes of data as they typically offer a single access port. Building a hash table with multiple parallel ports either has an excessive cost in memory resources, i.e., requiring redundant copies of its contents, and/or exhibits a worst case performance of just a single port memory due to bank conflicts. This work introduces a new multi-port hash table design, called Multi Hash Table, which does not require content replication to offer conflict free parallelism. Multi Hash Table avoids conflicts among its parallel banks by (i) supporting different dynamic mappings of its hash table address to index to the banks, and by (ii) caching (and aggregating) accesses to frequently used entries. The Multi Hash Table is used for reconfigurable single sliding window stream aggregation, increasing processing throughput by 7.5x.
暂无评论