In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communicati...
详细信息
ISBN:
(数字)9781728161495
ISBN:
(纸本)9781728161501
In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a three-level memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.
From assisting with assembling the Orion capsule to using highly immersive virtual environments for astronaut training, MR technologies provide a powerful mechanism to alter the perception of the physical world and de...
详细信息
From assisting with assembling the Orion capsule to using highly immersive virtual environments for astronaut training, MR technologies provide a powerful mechanism to alter the perception of the physical world and deliver realistic personalized visual stimuli to users. In this paper, we discuss a novel strategy to utilize MR technologies as a design element to enhance the interior architecture of the space habitat and enrich the inhabitants' personal experience. We discuss two scenarios that entail long-duration missions as well as a customized experience for space tourists in the Low Earth Orbit (LEO). A series of spacecraft volumetric studies of the ergonomics associated with the application of MR technologies are reported. Physical, virtual and combined experiences are mapped within the volumes with respect to crew ConOps. The experiences are then analyzed and translated to architectural design requirements that inform criteria for the development of personalized MR- based interventions. For the first scenario, NASA's 500 days on the surface of Mars mission is considered, which requires 600 additional days in microgravity transit inside the Deep Space Transfer vehicle, a 7.2 m wide hard-shell module. In this scenario, MR experiences are used as a stress countermeasure to help a crew of four to sustain psychological and behavioral health, maintain productivity, and stimulate teamwork and performance. This is accomplished by providing novelty in the habitat as well as designing content that can increase the volumetric perception of the environment. The second scenario is presented in the context of space tourism where habitats with minimum physical interior design elements can be transformed into comfortable habitable personal environments. Bigelow Space Operations' B330 was selected as a reference site for a 12-day LEO tourism mission. We discuss a design approach that provides tourists with a high level of comfort by using projection-based MR technologies to cus
This paper develops a model to plan energy-efficient speed trajectories of electric trucks in real time by taking into account the information of topography and traffic ahead of the vehicle. In this real time control ...
详细信息
The last two decades have witnessed a large number of proposals on the last-level cache (LLC) replacement policy aiming to minimize the number of LLC read misses. Another independent large body of work has explored me...
The last two decades have witnessed a large number of proposals on the last-level cache (LLC) replacement policy aiming to minimize the number of LLC read misses. Another independent large body of work has explored mechanisms to address the inefficiencies arising from the DRAM writes introduced by the LLC replacement policy. These DRAM scheduling proposals, however, leave the LLC replacement policy unchanged and, as a result, miss the opportunity of synergistically shaping and scheduling the DRAM write bandwidth demand. In this paper, we argue that DRAM read and write bandwidth demands must be coordinated carefully from the LLC side and hence, introduce bandwidth-awareness in the LLC policy. Our bandwidth-aware LLC policy proposal enables long uninterrupted stretches of DRAM reads while maintaining the efficiency of the last-level cache and controlling precisely when and for how long writes can demand DRAM bandwidth. Our proposal comfortably outperforms the state-of-the-art eager DRAM write scheduling proposals and bridges 75% of the performance gap between the baseline and a hypothetical system that deploys an unbounded DRAM write buffer.
Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles d...
详细信息
Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataftow architecture achieves a 16.2% average efficiency improvement over that without the optimization.
Recent studies have focused on leveraging large-scale artificial intelligence (LAI) models to improve semantic representation and compression capabilities. However, the substantial computational demands of LAI models ...
详细信息
Many recent excellent methods for efficient real-time semantic segmentation are of low precision and heavily rely on multiple GPUs for training. In this paper, we rethink the critical factors affecting the accuracy of...
Many recent excellent methods for efficient real-time semantic segmentation are of low precision and heavily rely on multiple GPUs for training. In this paper, we rethink the critical factors affecting the accuracy of efficient segmentation models. The previous works usually reduce the input resolution prior to training the parameters of models by cropping or resizing the images. On the contrary, our empirical study shows that the reduced images lose the important content information and details, which are vital to the high precision. However, the previous methods are unable to train the original high-resolution images due to the memory-limited GPUs. To tackle this problem, we propose a novel versatile network (VNet), which employs reversible mechanism and asymmetric convolution to achieve highly efficient and extremely low memory consumption in backward propagation. In particular, we keep all the detailed spatial information of the input images without cropping or resizing to pursue decent prediction accuracy. It is worth noting that VNet can train multiple 1024×2048 high-resolution images on only one standard GPU card. Under the same conditions, our model achieves a new state-of-the-art result on Cityscapes datasets. Specifically, it can process the 1024×2048 high-resolution inputs at a rate of 37.4 and 15.5 frames per second (fps) on a standard GPU and an edge device, respectively, with only 0.16 million parameters.
Fast detection of heavy flows (e.g., heavy hitters and heavy changers) in massive network traffic is challenging due to the stringent requirements of fast packet processing and limited resource availability. Invertibl...
详细信息
The eXtended isogeometric analysis (X-IGA) combined with Particle swarm optimization (PSO) is used for crack identification in twodimensional linear elastic problems based on inverse problem. The application of fractu...
详细信息
暂无评论