The emergence of 3D-DRAM has rekindled interest in near data computing (NDC) research. This article introduces dataflow processing in memory (DFPIM) which melds near data computing, dataflow architecture, coarse-grain...
详细信息
ISBN:
(纸本)9781450353168
The emergence of 3D-DRAM has rekindled interest in near data computing (NDC) research. This article introduces dataflow processing in memory (DFPIM) which melds near data computing, dataflow architecture, coarse-grained reconfigurable logic (CGRL), and 3D-DRAM technologies to provide high performance and very high energy efficiency for stream oriented and big data application kernels. The application of dataflow architecture with a CGRL implementation provides a flexible, energy efficient computing platform. The initial evaluation presented in this paper shows an average speedup of 5.5 is achieved with an energy efficiency factor of 460.
Edge intelligence (EI), as a combination of artificial intelligence (AI) and internet of things (IoT), is the key to realize an interconnected world of everything. EI needs to process the data collected by edge device...
详细信息
Edge intelligence (EI), as a combination of artificial intelligence (AI) and internet of things (IoT), is the key to realize an interconnected world of everything. EI needs to process the data collected by edge devices locally. For edge devices with insufficient computing power, a near data computing (NDC) system can provide local computing capability by adding a co-processing unit near the memory. NDC systems are extensively studied focused on the optimization of the hardware architecture but lack of universal and transparent system support. Meanwhile, we notice that non-volatile memory (NVM) plays an important role in edge intermittent computing. In this article, we propose a novel in-memory processing file system (IMPFS) based on NVM to provide general support for NDC system running on edge devices. IMPFS uses standard file system interfaces and is optimized for NVM and AI application management. To verify the IMPFS, a prototype system is designed. The results showed that the IMPFS can not only provide universal support for EI but also further improve the processing speed by reducing the redundant software overhead.
data transfer overhead between computing cores and memory hierarchy has been a persistent issue for von Neumann architectures and the problem has only become more challenging with the emergence of manycore systems. A ...
详细信息
ISBN:
(纸本)9781450366694
data transfer overhead between computing cores and memory hierarchy has been a persistent issue for von Neumann architectures and the problem has only become more challenging with the emergence of manycore systems. A conceptually powerful approach to mitigate this overhead is to bring the computation closer to data, known as near data computing (NDC). Recently, NDC has been investigated in different flavors for CPU-based multicores, while the GPU domain has received little attention. In this paper, we present a novel NDC solution for GPU architectures with the objective of minimizing on-chip data transfer between the computing cores and Last-Level Cache (LLC). To achieve this, we first identify frequently occurring Load-Compute-Store instruction chains in GPU applications. These chains, when offloaded to a compute unit closer to where the data resides, can significantly reduce data movement. We develop two offloading techniques, called LLC-Compute and Omni-Compute. The first technique, LLC-Compute, augments the LLCs with computational hardware for handling the computation offloaded to them. The second technique (OmniCompute) employs simple bookkeeping hardware to enable GPU cores to compute instructions offloaded by other GPU cores. Our experimental evaluations on nine GPGPU workloads indicate that the LLC-Compute technique provides, on an average, 19% performance improvement (IPC), 11% performance/watt improvement, and 29% reduction in on-chip data movement compared to the baseline GPU design. The Omni-Compute design boosts these benefits to 31%, 16% and 44%, respectively.
General-Purpose Graphics Processing Units (GPGPUs) have become a dominant computing paradigm to accelerate diverse classes of applications primarily because of their higher throughput and better energy efficiency comp...
详细信息
General-Purpose Graphics Processing Units (GPGPUs) have become a dominant computing paradigm to accelerate diverse classes of applications primarily because of their higher throughput and better energy efficiency compared to CPUs. Moreover, GPU performance has been rapidly increasing due to technology scaling, increased core count and larger GPU cores. This has made GPUs an ideal substrate for building high performance, energy efficient computing systems. However, in spite of many architectural innovations in designing state-of-the-art GPUs, their deliverable performance falls far short of the achievable performance due to several issues. One of the major impediments to improving performance and energy efficiency of GPUs further is the overheads associated with data movement. The main motivation behind the dissertation is to investigate techniques to mitigate the effects of data movement towards performance on throughput architectures. It consists of three main components. The first part of this dissertation focuses on developing intelligent compute scheduling techniques for GPU architectures with support for processing in memory (PIM) capability. It performs an in-depth kernel-level analysis of GPU applications and develops prediction model for efficient compute scheduling and management between the GPU and the processing in memory enabled memory. The second part of this dissertation focuses on reducing the on-chip data movement footprint via efficient near data computing mechanisms. It identifies the basic forms of instructions that are ideal candidates for offloading and provides the necessary compiler and hardware support to enable offloading computations closer to where the data resides for improving the performance and energy-efficiency. The third part of this dissertation focuses on investigating new warp formation and scheduling mechanisms for GPUs. It identifies code regions that leads to the under-utilization of the GPU core. Specifically, it tackles the c
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM PRinS massively parallel compare operation finds match...
详细信息
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM PRinS massively parallel compare operation finds matching base pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi, and GPU-based implementations, showing at least 4.7 times higher throughput and at least 15 times lower power dissipation.
Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Pro...
详细信息
ISBN:
(纸本)9781450341219
Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new run-time techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).
To illustrate the architecture on-demands for data intensive geoscience product (AODIGP) processing, this paper proposed a division method of working domains which cover tasks domain, resources domain, and flows domai...
详细信息
ISBN:
(纸本)9781467372114
To illustrate the architecture on-demands for data intensive geoscience product (AODIGP) processing, this paper proposed a division method of working domains which cover tasks domain, resources domain, and flows domain. Take the remote sensing products in common properties as instance, products requirements would be parsed by means of knowledge base and inference engine. Abstract description method of remote sensing products can be used to constitute the workflow according to near data computing (NDC) ruler. Finally we verified the availability of AODIGP by constructing data center resources cooperatively-scheduled system (DCRCSS) and presented the improvement in the future.
Software data structures are a critical aspect of emerging data-centric applications which makes it imperative to improve the energy efficiency of data delivery. We propose SQRL, a hardware accelerator that integrates...
详细信息
ISBN:
(纸本)9781450328098
Software data structures are a critical aspect of emerging data-centric applications which makes it imperative to improve the energy efficiency of data delivery. We propose SQRL, a hardware accelerator that integrates with the last-level-cache (LLC) and enables energy-efficient iterative computation on data structures. SQRL integrates a data structure-specific LLC refill engine (Collector) with a compute array of lightweight processing elements (PEs). The collector exploits knowledge of the compute kernel to i) run ahead of the PEs in a decoupled fashion to gather data objects and ii) throttle fetch rate and adaptively tile the dataset based on the locality characteristics. The collector exploits data structure knowledge to find the memory level parallelism and eliminate data structure instructions.
暂无评论