Nowadays, the use of unmanned aerial vehicles (UAVs) as the fog access points (F-APs) is of high practical value to future fog radio access networks (F-RANs). Compared to the F-AP, UAV enabled F-AP possesses stronger ...
详细信息
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, indep...
详细信息
ISBN:
(纸本)9798400700958
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. the Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores pre-emption as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 5.7x lower average response times when compared to a no-sharing and no-virtualization scheduling algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.
Most learned B-frame codecs with hierarchical temporal prediction suffer from the domain shift issue caused by the discrepancy in the Group-of-Pictures (GOP) size used for training and test. As such, the motion estima...
详细信息
the proceedings contain 72 papers. the topics discussed include: InsightsSumm - summarization of ITOps incidents through in-context prompt engineering;fine-grained heterogeneous execution framework with energy aware s...
ISBN:
(纸本)9798350304817
the proceedings contain 72 papers. the topics discussed include: InsightsSumm - summarization of ITOps incidents through in-context prompt engineering;fine-grained heterogeneous execution framework with energy aware scheduling;learning representations on logs for AIOps;EN-Beats: a novel ensemble learning-based method for multiple resource predictions in cloud;the case for the anonymization of offloaded computation;deep reinforcement learning in cloud elasticity through offline learning and return based scaling;demystifying deep learning in predictive monitoring for cloud-native SLOs;fine-grained heterogeneous execution framework with energy aware scheduling;Storm-RTS: stream processing with stable performance for multi-cloud and cloud-edge;blaze: a high-performance, scalable, and efficient data transfer framework with configurable and extensible features;and Kepler: a framework to calculate the energy consumption of containerized applications.
General matrix multiply (GEMM) is an important operation in broad applications, especially the thriving deep neural networks. To achieve low power consumption for GEMM, researchers have already leveraged unary computi...
详细信息
ISBN:
(纸本)9781665420273
General matrix multiply (GEMM) is an important operation in broad applications, especially the thriving deep neural networks. To achieve low power consumption for GEMM, researchers have already leveraged unary computing, which manipulates bitstreams with extremely simple logic. However, existing unary architectures are not well generalizable to varying GEMM configurations in versatile applications and incompatible to the binary computing stack, imposing challenges to execute unary GEMM effortlessly. In this work, we address the problem by architecting a hybrid unary-binary systolic array, uSystolic, to inherit the legacy-binary data scheduling with slow (thus power-efficient) data movement, i.e., data bytes are crawling out from memory to drive uSystolic. uSystolic exhibits tremendous area and power improvements as a joint effect of 1) low-power computing kernel, 2) spatial-temporal bitstream reuse, and 3) on-chip SRAM elimination. For the evaluated edge computing scenario, compared withthe binary parallel design, the rated-coded uSystolic reduces the systolic array area and total on-chip area by 59.0% and 91.3%, withthe on-chip energy and power efficiency improved by up to 112.2x and 44.8x for AlexNet.
Database Management Systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centers. there have been many efforts to accelerate DBMS application performance...
详细信息
ISBN:
(纸本)9781665476522
Database Management Systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centers. there have been many efforts to accelerate DBMS application performance. One of the most explored techniques is the use of vector processing. Unfortunately, conventional vector architectures have not been able to exploit the full potential of DBMS acceleration. In this paper, we present VAQUERO, our Scratchpad-based Vector Accelerator for QUEry pROcessing. VAQUERO improves the efficiency of vector architectures for DBMS operations such as data aggregation and hash joins featuring lookup tables. Lookup tables are significant contributors to the performance bottlenecks in DBMS processing suffering from insufficient ISA support in the form of scatter-gather instructions. VAQUERO introduces a novel Advanced Scratchpad Memory specifically designed with two mapping modes - direct- and associative-mode. these mapping modes enable VAQUERO to accelerate real-world databases with workload sizes that significantly exceed the scratchpad memory capacity. Additionally, the associative-mode allows to use VAQUERO with DBMS operators that use hashed keys, e.g. hash-join and hash-aggregate. VAQUERO has been designed considering general DBMS algorithm requirements instead of being based on a particular database organization. For this reason, VAQUERO is capable to accelerate DBMS operators for both row- and column-oriented databases. In this paper, we evaluate the efficiency of VAQUERO using two highly optimized popular open-source DBMS, namely the row-based PostgreSQL and column-based MonetDB. We implemented VAQUERO at the RTL level and prototype it, by performing Place&Route, at the 7nm technology node. VAQUERO incurs a modest 0.15% area overhead compared with an Intel Ice Lake processor. Our evaluation shows that VAQUERO significantly outperforms PostgreSQL and MonetDB by 2.09x and 3.32x respectively, when processing operators and queries
the amazing success of deep neural network benefits from the rise of big data. As deep learning models are becoming more scale than ever before, their requirements for memory bandwidth are growing at a tremendous pace...
详细信息
Although serverless computing is a popular paradigm, current serverless environments have high overheads. Recently, it has been shown that serverless workloads frequently exhibit bursts of invocations of the same func...
详细信息
ISBN:
(纸本)9798400700958
Although serverless computing is a popular paradigm, current serverless environments have high overheads. Recently, it has been shown that serverless workloads frequently exhibit bursts of invocations of the same function. Such pattern is not handled well in current platforms. Supporting it efficiently can speed-up serverless execution substantially. In this paper, we target this dominant pattern with a new serverless platform design named MXFaaS. MXFaaS improves function performance by efficiently multiplexing (i.e., sharing) processor cycles, I/O bandwidth, and memory/processor state between concurrently executing invocations of the same function. MXFaaS introduces a new container abstraction called MXContainer. To enable efficient use of processor cycles, an MXContainer carefully helps schedule same-function invocations for minimal response time. To enable efficient use of I/O bandwidth, an MXContainer coalesces remote storage accesses and remote function calls from same-function invocations. Finally, to enable efficient use of memory/processor state, an MXContainer first initializes the state of its container and only later, on demand, spawns a process per function invocation, so that all invocations can share unmodified memory state and hence minimize memory footprint. We implement MXFaaS in two serverless platforms and run diverse serverless benchmarks. With MXFaaS, serverless environments are much more efficient. Compared to a state-of-the-art serverless environment, MXFaaS on average speeds-up execution by 5.2x, reduces P99 tail latency by 7.4x, and improves throughput by 4.8x. In addition, it reduces the average memory usage by 3.4x.
Coarse-grained reconfigurable architecture (CGRA) has become a promising candidate for data-intensive computing due to its flexibility and high energy efficiency. CGRA compilers map data flow graphs (DFGs) extracted f...
详细信息
ISBN:
(纸本)9798400700958
Coarse-grained reconfigurable architecture (CGRA) has become a promising candidate for data-intensive computing due to its flexibility and high energy efficiency. CGRA compilers map data flow graphs (DFGs) extracted from applications onto CGRAs, playing a fundamental role in fully exploiting hardware resources for acceleration. Yet the existing compilers are time-demanding and cannot guarantee optimal results due to the traversal search of enormous search spaces brought about by the spatio-temporal flexibility of CGRA structures and the complexity of DFGs. Inspired by the amazing progress in reinforcement learning (RL) and Monte-Carlo tree search (MCTS) for real-world problems, we consider constructing a compiler that can learn from past experiences and comprehensively understand the target DFG and CGRA. In this paper, we propose an architecture-aware compiler for CGRAs based on RL and MCTS, called MapZero - a framework to automatically extract the characteristics of DFG and CGRA hardware and map operations onto varied CGRA fabrics. We apply Graph Attention Network to generate an adaptive embedding for DFGs and also model the functionality and interconnection status of the CGRA, aiming at training an RL agent to perform placement and routing intelligently. Experimental results show that MapZero can generate superior-quality mappings and reduce compilation time hundreds of times compared to state-of-the-art methods. MapZero can find high-quality mappings very quickly when the feasible solution space is rather small and all other compilers fail. We also demonstrate the scalability and broad applicability of our framework.
暂无评论