Various Internet of things (IoT) devices bring great convenience to users. However, the privacy protection and overall performance still need improvements due to their limited storage and computation abilities. Cloud ...
详细信息
Compared to traditional relational databases (RDBMS), specialized graph databases (GDBs) can efficiently store and process graph data in both time and space. Hence, domains like social networks often use GDBs for data...
详细信息
Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the-art TT based DNN accelerator, achieve...
详细信息
ISBN:
(纸本)9798400700958
Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the-art TT based DNN accelerator, achieved highperformance by leveraging a compact inference scheme to remove unnecessary computations and memory access. However, TIE increases memory costs for stage-wise intermediate results and additional intra-layer data transfer, leading to limited speedups even the models are highly compressed. To unleash the full potential of TT decomposition, this paper proposes ETTE, an algorithm and hardware co-optimization framework for Efficient Tensor-Train Engine. At the algorithm level, ETTE proposes new tensor core construction and computation ordering mechanism to reduce stage-wise computation and storage cost at the same time. At the hardware level, ETTE proposes a lookahead-style across-stage processing scheme to eliminate the unnecessary stage-wise data movement. By fully leveraging the decoupled input and output dimension factors, ETTE develops an efficient low-cost memory partition-free access scheme to efficiently support the desired matrix transformation. We demonstrate the effectiveness of ETTE via implementing a 16PE hardware prototype with CMOS 28nm technology. Compared with GPU on various workloads, ETTE achieves 6.5x - 253.1x higher throughput and 189.2x - 9750.5x higher energy efficiency. Compared withthe state-of-the-art DNN accelerators, ETTE brings 1.1x - 58.3x, 2.6x - 1170.4x and 1.8x - 2098.2x improvement on throughput, energy efficiency and area efficiency, respectively.
Seismic data contains valuable information about the Earth's subsurface, which is useful in oil and gas (O&G) exploration. Seismic attributes are derived from seismic data to highlight relevant data structures...
详细信息
ISBN:
(纸本)9798350381603
Seismic data contains valuable information about the Earth's subsurface, which is useful in oil and gas (O&G) exploration. Seismic attributes are derived from seismic data to highlight relevant data structures and properties, improving geological or geophysical data interpretation. However, when calculated on large datasets, quite common in the O&G industry, these attributes may be computationally expensive regarding computing power and memory capacity. Deep learning techniques can reduce these costs by avoiding direct attribute calculation. Some of these techniques may, however, be too complex, require large volumes of training data, and demand large computational capacity. this work shows that a conventional U-Net Convolutional Neural Network (CNN) model, with 31 million parameters, can be used to compute diverse seismic attributes directly from seismic data. the F3 dataset and attributes calculated on it were employed to train the models, each specialized in a specific attribute. the trained CNN models yield low prediction errors for most of the tested attributes. these results evince that simple CNN models are able to infer seismic attributes withhigh accuracy.
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, indep...
详细信息
ISBN:
(纸本)9798400700958
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. the Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores pre-emption as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 5.7x lower average response times when compared to a no-sharing and no-virtualization scheduling algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.
Although serverless computing is a popular paradigm, current serverless environments have high overheads. Recently, it has been shown that serverless workloads frequently exhibit bursts of invocations of the same func...
详细信息
ISBN:
(纸本)9798400700958
Although serverless computing is a popular paradigm, current serverless environments have high overheads. Recently, it has been shown that serverless workloads frequently exhibit bursts of invocations of the same function. Such pattern is not handled well in current platforms. Supporting it efficiently can speed-up serverless execution substantially. In this paper, we target this dominant pattern with a new serverless platform design named MXFaaS. MXFaaS improves function performance by efficiently multiplexing (i.e., sharing) processor cycles, I/O bandwidth, and memory/processor state between concurrently executing invocations of the same function. MXFaaS introduces a new container abstraction called MXContainer. To enable efficient use of processor cycles, an MXContainer carefully helps schedule same-function invocations for minimal response time. To enable efficient use of I/O bandwidth, an MXContainer coalesces remote storage accesses and remote function calls from same-function invocations. Finally, to enable efficient use of memory/processor state, an MXContainer first initializes the state of its container and only later, on demand, spawns a process per function invocation, so that all invocations can share unmodified memory state and hence minimize memory footprint. We implement MXFaaS in two serverless platforms and run diverse serverless benchmarks. With MXFaaS, serverless environments are much more efficient. Compared to a state-of-the-art serverless environment, MXFaaS on average speeds-up execution by 5.2x, reduces P99 tail latency by 7.4x, and improves throughput by 4.8x. In addition, it reduces the average memory usage by 3.4x.
Coarse-grained reconfigurable architecture (CGRA) has become a promising candidate for data-intensive computing due to its flexibility and high energy efficiency. CGRA compilers map data flow graphs (DFGs) extracted f...
详细信息
ISBN:
(纸本)9798400700958
Coarse-grained reconfigurable architecture (CGRA) has become a promising candidate for data-intensive computing due to its flexibility and high energy efficiency. CGRA compilers map data flow graphs (DFGs) extracted from applications onto CGRAs, playing a fundamental role in fully exploiting hardware resources for acceleration. Yet the existing compilers are time-demanding and cannot guarantee optimal results due to the traversal search of enormous search spaces brought about by the spatio-temporal flexibility of CGRA structures and the complexity of DFGs. Inspired by the amazing progress in reinforcement learning (RL) and Monte-Carlo tree search (MCTS) for real-world problems, we consider constructing a compiler that can learn from past experiences and comprehensively understand the target DFG and CGRA. In this paper, we propose an architecture-aware compiler for CGRAs based on RL and MCTS, called MapZero - a framework to automatically extract the characteristics of DFG and CGRA hardware and map operations onto varied CGRA fabrics. We apply Graph Attention Network to generate an adaptive embedding for DFGs and also model the functionality and interconnection status of the CGRA, aiming at training an RL agent to perform placement and routing intelligently. Experimental results show that MapZero can generate superior-quality mappings and reduce compilation time hundreds of times compared to state-of-the-art methods. MapZero can find high-quality mappings very quickly when the feasible solution space is rather small and all other compilers fail. We also demonstrate the scalability and broad applicability of our framework.
Compute nodes in modern HPC systems are growing in size and their hardware has become ever more diverse. Still, many HPC centers allocate the resources of full nodes exclusively to avoid contention, despite the associ...
详细信息
Microservices are emerging as a popular cloud-computing paradigm. Microservice environments execute typically-short service requests that interact with one another via remote procedure calls (often across machines), a...
详细信息
ISBN:
(纸本)9798400700958
Microservices are emerging as a popular cloud-computing paradigm. Microservice environments execute typically-short service requests that interact with one another via remote procedure calls (often across machines), and are subject to stringent tail-latency constraints. In contrast, current processors are designed for traditional monolithic applications. they support global hardware cache coherence, provide large caches, incorporate microarchitecture for long-running, predictable applications (such as advanced prefetching), and are optimized to minimize average latency rather than tail latency. To address this imbalance, this paper proposes mu Manycore, an architecture optimized for cloud-native microservice environments. Based on a characterization of microservice applications, mu Manycore is designed to minimize unnecessary microarchitecture and mitigate overheads to reduce tail latency. Indeed, rather than supporting manycore-wide hardware cache coherence, mu Manycore has multiple small hardware cache-coherent domains, called Villages. Clusters of villages are interconnected with an on-package leaf-spine network, which has many redundant, low-hop-count paths between clusters. To minimize latency overheads, mu Manycore schedules and queues service requests in hardware, and includes hardware support to save and restore process state when doing a context-switch. Our simulation-based results show that mu Manycore delivers highperformance. A cluster of 10 servers with a 1024-core mu Manycore in each server delivers 3.7x lower average latency, 15.5x higher throughput, and, importantly, 10.4x lower tail latency than a cluster with iso-power conventional server-class multicores. Similar good results are attained compared to a cluster with power-hungry iso-area conventional server-class multicores.
the proceedings contain 16 papers. the topics discussed include: a low-power and real-time neural-rendering dense SLAM processor with 3-level hierarchical sparsity exploitation;reinforcement learning hardware accelera...
ISBN:
(纸本)9798350384147
the proceedings contain 16 papers. the topics discussed include: a low-power and real-time neural-rendering dense SLAM processor with 3-level hierarchical sparsity exploitation;reinforcement learning hardware accelerator using cache-based memoization for optimized Q-table selection;branch divergence-aware flexible approximating technique on GPUs;a 22 nm 10 TOPS mixed-precision neural network SoC for image processing with energy-efficient dilated convolution support;bit-separable radix-4 booth multiplier for power-efficient CNN accelerator;a microservice scheduler for heterogeneous resources on edge-cloud computing continuum;power-efficient acceleration of GCNs on coarse-grained linear arrays;and MRCA: multi-grained reconfigurable cryptographic accelerator for diverse security requirements.
暂无评论