The proceedings contain 50 papers. The topics discussed include: performance improvement via always-abort HTM;DRUT: an efficient turbo boost solution via load balancing in decoupled look-ahead architecture;proxy bench...
ISBN:
(纸本)9781467395243
The proceedings contain 50 papers. The topics discussed include: performance improvement via always-abort HTM;DRUT: an efficient turbo boost solution via load balancing in decoupled look-ahead architecture;proxy benchmarks for emerging big-data workload;lightweight provenance service for high-performance computing;application-driven near-data processing for similarity search;improving datacenter efficiency through partitioning-aware scheduling;the liberation day of nondeterministic program;location-aware computation mapping for ManyCore processors;DaQueue: a data-aware work-queue design for GPGPUs;accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls;design space exploration for performance optimization of deep neural networks on shared memory accelerators;cutting the fat: speeding up RBM for fast deep learning through generalized redundancy elimination;statement reordering to alleviate register pressure for stencils oN GPUS;NUMA-aware power management for chip multiprocessors;elastic reconfiguration for heterogeneous NOCs with Binochs;leeway: addressing variability in dead-block prediction for last-level caches;application clustering policies to address system fairness with Intel's cache allocation technology;GRaphie: large-scale asynchronous graph traversals on just a GPU;weak memory models: balancing definitional simplicity and implementation flexibility;exploiting asymmetric SIMD register configurations in arm-to-x86 dynamic binary translation;and a generalized framework for automatic scripting language parallelization.
As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs...
详细信息
ISBN:
(纸本)9798400708435
As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance code generated from recent tensor compilers, they often take a long time for iteratively searching candidate configurations. Such a long time slows down the cycle of exploring better DNN architectures or pruning algorithms. In this paper, we propose EC-SpMM to efficiently generate high-performance SpMM kernels for sparse DNN inference. Based on the analysis of nonzero elements' layout, the characterization of GPU architecture, and a rank-based cost model, EC-SpMM can effectively reduce the search space and eliminate possibly low-performance candidates. Experimental results show that EC-SpMM can reduce the compilation time by a factor of 35x, while the performance of generated SpMM kernels is comparable or even better, compared with the state-of-the-art sparse tensor compiling solution.
The proceedings contain 8 papers. The topics discussed include: ByteNite: a new business model for grid computing;challenges and opportunities in C/C++ source-to-source compilation;RUST-encoded stream ciphers on a RIS...
ISBN:
(纸本)9783959772693
The proceedings contain 8 papers. The topics discussed include: ByteNite: a new business model for grid computing;challenges and opportunities in C/C++ source-to-source compilation;RUST-encoded stream ciphers on a RISC-V parallel ultra-low-power processor;an evaluation of the state-of-the-art software and hardware implementations of BIKE;MonTM: monitoring-based thermal management for mixed-criticality systems;dynamic power consumption of the full posit processing unit: analysis and experiments;and adjacent LSTM-based page scheduling for hybrid DRAM/NVM memory systems.
Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. PAR...
详细信息
ISBN:
(纸本)9798400710797
Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. PARENDI is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. PARENDI scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4x faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilationtechniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that PARENDI uses to optimize them.
Neutral atom arrays, particularly the reconfigurable field programmable qubit arrays (FPQA) with atom movement, show strong promise for quantum computing. FPQA has a dynamic qubit connectivity, facilitating cost-effec...
详细信息
The proceedings contain 19 papers. The topics discussed include: structures and techniques for streaming dynamic graph processing on decentralized message-driven systems;interference-aware function inlining for code s...
ISBN:
(纸本)9798400718021
The proceedings contain 19 papers. The topics discussed include: structures and techniques for streaming dynamic graph processing on decentralized message-driven systems;interference-aware function inlining for code size reduction;the rewriting of DataRaceBench benchmark for OpenCL program validations;support post quantum cryptography with SIMD everywhere on RISC-V architectures;substitution of kernel functions based on pattern matching on schedule trees;fusing depthwise and pointwise convolutions for efficient inference on GPUs;design of a decentralized Web3 access interface;a distributed particle swarm optimization algorithm based on Apache spark for asynchronous parallel training of deep neural networks;and graph federated learning with center moment constraints for node classification.
Photonic computing, known for its high bandwidth and energy efficiency, harnesses physical phenomena in the optical domain to accelerate a wide range of computational operations such as dot product, matrix multiplicat...
详细信息
ISBN:
(纸本)9798350380415;9798350380408
Photonic computing, known for its high bandwidth and energy efficiency, harnesses physical phenomena in the optical domain to accelerate a wide range of computational operations such as dot product, matrix multiplication, Fourier transform, 1D convolution, and more. However, the multitude of computational operations mentioned above poses challenges in mapping realistic neural network workloads onto underlying photonic hardware. This complexity requires extensive expertise and laborious programming, impeding the practical adoption and deployment of photonic acceleration. To address this gap, we propose an end-to-end compilation framework comprising a Photonic Compiler Collection (PCC). This framework automates the mapping of high-level deep neural network (DNN) specifications onto target architectures of photonic-electronic accelerators. Additionally, we present a method to streamline neural network workloads by leveraging the multi-level intermediate representation (MLIR) and compiler optimization techniques, targeting photonic-specific patterns. Moreover, we conduct a comprehensive case study illustrating the integration of a typical computational operator, the MachZehnder Interferometer (MZI) mesh, into PCC. Our experimental results demonstrate that PCC achieves up to a 4x speedup on DNN workloads compared to hand-crafted implementations. In summary, our proposed framework offers a practical and automated solution for compiling, optimizing, and flexibly supporting newer operators of photonic devices. We anticipate that our framework will significantly accelerate the development and deployment of photonic applications in real-world AI scenarios.
Within CPSS, traditional blockchain technology encounters scalability and flexibility issues. With its multi-chain structure, parallel blockchain provides an innovative way to address these problems. This paper examin...
详细信息
In this study, the efficiency of parallel and serial computation techniques for aggregating data from diverse regions is investigated. parallel computation breaks data into smaller segments and assigns each segment to...
暂无评论