IoT devices commonly use flash memory for both data and code storage. Flash memory consumes a significant portion of the overall energy of such devices. this is problematic because IoT devices are energy constrained d...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
IoT devices commonly use flash memory for both data and code storage. Flash memory consumes a significant portion of the overall energy of such devices. this is problematic because IoT devices are energy constrained due to their reliance on batteries or energy harvesting. To save energy, we leverage a unique property of flash memory;write operations take unequal amounts of energy depending on if we are flipping a 1. 0 versus a 0. 1. We exploit this asymmetry to reduce energy consumption with FLIPBIT, a hardware-software approximation approach that limits costly 0. 1 transitions in flash. Instead of performing an exact write, we write an approximated value that avoids any costly 0. 1 bit flips. Using FLIPBIT, we reduce the mean energy used by flash by 68% on video streaming applications while maintaining 42 dB PSNR. On machine learning models, we reduce energy by an average of 39% and up to 71% with only a 1% accuracy loss. Additionally, by reducing the number of program-erase cycles, we increase the flash lifetime by 68%.
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide (e.g., 16,384-262,144-bit-wide) data-parallel operations, in a single...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide (e.g., 16,384-262,144-bit-wide) data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD parallelism (which is often smaller than the DRAM row granularity), PUD execution often leads to underutilization, throughput loss, and energy waste. Second, due to the high area cost of implementing interconnects that connect columns in a wide DRAM row, most PUD architectures are limited to the execution of parallel map operations, where a single operation is performed over equally-sized input and output arrays. third, the need to feed the wide DRAM row with tens of thousands of data elements combined withthe lack of adequate compiler support for PUD systems create a programmability barrier, since programmers need to manually extract SIMD parallelism from an application and map computation to the PUD hardware. Our goal is to design a flexible PUD system that overcomes the limitations caused by the large and rigid granularity of PUD. To this end, we propose MIMDRAM, a hardware/software co-designed PUD system that introduces new mechanisms to allocate and control only the necessary resources for a given PUD operation. the key idea of MIMDRAM is to leverage finegrained DRAM (i.e., the ability to independently access smaller segments of a large DRAM row) for PUD computation. MIMDRAM exploits this key idea to enable a multiple-instruction multiple-data (MIMD) execution model in each DRAM subarray (and SIMD execution within each DRAM row segment). We evaluate MIMDRAM using twelve real-world applications and 495 multi-programmed application mixes. Our evaluation shows that MIMDRAM provides 34x the performance, 14.3x the energy efficiency, 1.7x the throughp
Serverless computing has emerged as a popular cloud computing paradigm. Serverless environments are convenient to users and efficient for cloud providers. However, they can induce substantial application execution ove...
详细信息
ISBN:
(纸本)9781665476522
Serverless computing has emerged as a popular cloud computing paradigm. Serverless environments are convenient to users and efficient for cloud providers. However, they can induce substantial application execution overheads, especially in applications with many functions. In this paper, we propose to accelerate serverless applications with a novel approach based on software-supported speculative execution of functions. Our proposal is termed Speculative Function-as-a-Service (SpecFaaS). It is inspired by out-of-order execution in modern processors, and is grounded in a characterization analysis of FaaS applications. In SpecFaaS, functions in an application are executed early, speculatively, before their control and data dependences are resolved. Control dependences are predicted like in pipeline branch prediction, and data dependences are speculatively satisfied with memoization. Withthis support, the execution of downstream functions is overlapped withthat of upstream functions, substantially reducing the end-to-end execution time of applications. We prototype SpecFaaS on Apache OpenWhisk, an open-source serverless computing platform. For a set of applications in a warmed-up environment, SpecFaaS attains an average speedup of 4.6x. Further, on average, the application throughput increases by 3.9x and the tail latency decreases by 58.7%.
In this paper, we discuss an IEEE 754 compliant normalized floating-point divide and square root unit that utilizes iterative approximation. We provide a robust architecturethat allows multiple formats and all IEEE 7...
详细信息
Modern cloud applications are prone to high tail latencies since their requests typically follow highly-dispersive distributions. Prior work has proposed both OS- and systemlevel solutions to reduce tail latencies for...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Modern cloud applications are prone to high tail latencies since their requests typically follow highly-dispersive distributions. Prior work has proposed both OS- and systemlevel solutions to reduce tail latencies for microsecond-scale workloads through better scheduling. Unfortunately, existing approaches like customized dataplane OSes, require significant OS changes, experience scalability limitations, or do not reach the full performance capabilities hardware offers. We propose LibPreemptible, a preemptive user-level threading library that is flexible, lightweight, and scalable. LibPreemptible is based on three key techniques: 1) a fast and lightweight hardware mechanism for delivery of timed interrupts, 2) a general-purpose user-level scheduling interface, and 3) an API for users to express adaptive scheduling policies tailored to the needs of their applications. Compared to the prior state-of-the-art scheduling system Shinjuku, our system achieves significant tail latency and throughput improvements for various workloads without the need to modify the kernel. We also demonstrate the flexibility of LibPreemptible across scheduling policies for real applications experiencing varying load levels and characteristics.
Data-centric applications are increasingly more common, causing issues brought on by the discrepancy between processor and memory technologies to be increasingly more apparent. Near-Data Processing (NDP) is an approac...
详细信息
ISBN:
(纸本)9798350305487
Data-centric applications are increasingly more common, causing issues brought on by the discrepancy between processor and memory technologies to be increasingly more apparent. Near-Data Processing (NDP) is an approach to mitigate this issue. It proposes moving some of the computation close to the memory, thus allowing for reduced data movement and aiding data-intensive workloads. Analytical database queries are very commonly used in NDP research due to their intrinsics usage of very large volumes of data. In this paper, we investigate the migration of most time-consuming database operators to VIMA, a novel 3D-stacked memory-based NDP architecture. We consider the selection, projection, and bloom join database query operators, commonly used by data analytics applications, comparing Vector-In-Memory architecture (VIMA) to a highperformance x86 baseline. We pitch VIMA against both a single-thread baseline and a modern 16-thread x86 system to evaluate its performance. Against a single-thread baseline, our experiments show that VIMA is able to speed up execution by up to 5x for selection, 2.5x for projection, and 16x for join while consuming up to 99% less energy. When considering a multi-thread baseline, VIMA matches the execution time performance even at the largest dataset sizes considered. In comparison to existing state-of-the-art NDP platforms, we find that our approach achieves superior performance for these operators.
the proceedings contain 78 papers. the topics discussed include: exploitation of security vulnerability on retirement;GadgetSpinner: a new transient execution primitive using the loop stream detector;uncovering and ex...
ISBN:
(纸本)9798350393132
the proceedings contain 78 papers. the topics discussed include: exploitation of security vulnerability on retirement;GadgetSpinner: a new transient execution primitive using the loop stream detector;uncovering and exploiting AMD speculative memory access predictors for fun and profit;Revet: a language and compiler for dataflow threads;an optimizing framework on MLIR for efficient FPGA-based accelerator generation;Celeritas: out-of-core based unsupervised graph neural network via cross-layer computing 2024;MEGA: a memory-efficient GNN accelerator exploiting degree-aware mixed-precision quantization;Gemini: mapping and architecture co-exploration for large-scale DNN Chiplet accelerators;STELLAR: energy-efficient and low-latency SNN algorithm and hardware co-design with spatiotemporal computation;and MIMDRAM: an end-to-end processing-using-DRAM system for high-throughput, energy-efficient and programmer-transparent multiple-instruction multiple-data computing.
Graph neural networks (GNN) one of the most popular neural network models, are extensively applied in graphrelated fields, including drug discovery, recommendation systems, etc. Unsupervised graph learning as one type...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Graph neural networks (GNN) one of the most popular neural network models, are extensively applied in graphrelated fields, including drug discovery, recommendation systems, etc. Unsupervised graph learning as one type of GNN plays a crucial role in various graph-related missions like node classification and edge prediction. However, withthe increasing size of real-world graph datasets, processing such massive graphs in host memory becomes impractical, and GNN training demands a substantial storage volume to accommodate the vast amount of graph data. Consequently, GNN training results in significant I/O migration between the host and storage. Although state-ofthe-art frameworks have made strides in mitigating I/O overhead by considering embedding locality, their GNN frameworks still suffer from long training times. In this paper, we propose a fully out-of-core framework, called Celeritas, which speeds up the unsupervised GNN training on a single machine by co-designing the GNN algorithm and storage systems. First, based on the theoretical analysis, we propose a new partial combination operation to enable the embedding updates across GNN layers. this cross-layer computing achieves future computation for the embedding stored in memory to save data migration. Second, due to the dependency between embedding and edges, we consider their data locality together. Based on the cross-layer computing property, we propose a new loading order to fully utilize the data stored in the main memory to save I/O. Finally, a new sampling scheme called two-level sampling is proposed associated with a new partition algorithm to further reduce data migration and computation overhead while maintaining similar training accuracy. the real system experiments indicate that the proposed Celeritas can reduce the total training time of different GNN models from 44.76% to 73.85% compared to state-of-art schemes for different graph datasets.
Sparse Matrix Dense Matrix Multiplication (SpMM) is an important kernel with application across a wide range of domains, including machine learning and linear algebra solvers. In many sparse matrices, the pattern of n...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Sparse Matrix Dense Matrix Multiplication (SpMM) is an important kernel with application across a wide range of domains, including machine learning and linear algebra solvers. In many sparse matrices, the pattern of nonzeros is nonuniform: nonzeros form dense and sparse regions, rather than being uniformly distributed across the whole matrix. We refer to this property as Intra-Matrix Heterogeneity (IMH). Currently, SpMM accelerator designs do not leverage this heterogeneity. they employ the same processing elements (PEs) for all the regions of a sparse matrix, resulting in suboptimal acceleration. To address this limitation, we utilize heterogeneous SpMM accelerator architectures, which include different types of PEs to exploit IMH. We develop an analytical modeling framework to predict the performance of different types of accelerator PEs taking into account IMH. Furthermore, we present a heuristic for partitioning sparse matrices among heterogeneous PEs. We call our matrix modeling and partitioning method HotTiles. To evaluate HotTiles, we simulate three different heterogeneous architectures. Each one consists of two types of workers (i.e., PEs): one suited for compute-bound denser regions (Hot Worker) and one for memory-bound sparser regions (Cold Worker). Our results show that exploiting IMH with HotTiles is very effective. Depending on the architecture, heterogeneous execution with HotTiles outperforms homogeneous execution using only hot or only cold workers by 9.2-16.8x and 1.4-3.7x, respectively. In addition, HotTiles outperforms the best worker type used on a per-matrix basis by 1.3-2.5x. Finally, HotTiles outperforms an IMH-unaware heterogeneous execution strategy by 1.4-2.2x.
Withthe rapid growth of classification scale in deep learning systems, the final classification layer becomes extreme classification with a memory footprint exceeding the main memory capacity of the CPU or GPU. the e...
详细信息
ISBN:
(纸本)9798400700958
Withthe rapid growth of classification scale in deep learning systems, the final classification layer becomes extreme classification with a memory footprint exceeding the main memory capacity of the CPU or GPU. the emerging in-storage-computing technique offers an opportunity on account of the fact that SSD has enough storage capacity for the parameters of extreme classification. However, the limited performance of naive in-storage-computing schemes is insufficient to support the heavy workload of extreme classification. To this end, we propose ECSSD, the first hardware/data layout co-designed in-storage-computingarchitecture for extreme classification, based on the approximate screening algorithm. We propose an alignment-free floating-point MAC circuit technique to improve the computational ability under the limited area budget of in-storage-computing schemes so that the computational ability can match SSD's high internal bandwidth. We present a heterogeneous data layout design for the 4/32-bit weight data in the approximate screening algorithm to avoid data transfer interference and further utilize the internal DRAM bandwidth of SSD. Moreover, we propose a learning-based adaptive interleaving framework to balance the access workload in each flash channel and improve channel-level bandwidth utilization. Putting them together, our ECSSD achieves 3.24-49.87x performance improvements compared with state-of-the-art baselines.
暂无评论