Recent advances like deep learning algorithms or virtual reality applications require an amount of computational power in increasingly smaller devices never seen before. Heterogeneous architectures are seen as a solut...
详细信息
ISBN:
(纸本)9781450364942
Recent advances like deep learning algorithms or virtual reality applications require an amount of computational power in increasingly smaller devices never seen before. Heterogeneous architectures are seen as a solution to this problem, since they provide a significantly better performance per watt. However, their comparatively difficult integration and programming are a major drawback of this solution. The Heterogeneous System Architecture (HSA) Foundation accepted this problem and published a set of specifications to provide a uniform solution for different architectures. This has been successfully applied to GPUs and DSPs, but is insufficiently researched for FPGAs. However, the usage of reconfigurable logic can further increase the efficiency of embeddedsystems, while the hurdle of integration can be lowered by high-level synthesis (HLS). Therefore, a combination of the HSA specifications with HLS seems worthwhile to increase the heterogeneity and ease the programmability of FPGAs. This paper provides a suitability analysis of the HSA Intermediate Language (HSAIL) for HLS with suggestions for further improvements to the specification as well as an associated proof of concept. The results show that it is indeed possible to build such a hybrid HLS flow based on the HSA standard, where parts of the process can be shared with other classes of accelerators like GPUs. This implies that progress in the GPU sector like the available source languages can faster carry over to FPGAs and improve the general accessibility.
Reconfigurable hardware is becoming increasingly mainstream, evolving to a valid alternative to Graphics Processing Units-based hardware accelerators. However, several major challenges remain for migrating existing so...
详细信息
ISBN:
(纸本)9781450364942
Reconfigurable hardware is becoming increasingly mainstream, evolving to a valid alternative to Graphics Processing Units-based hardware accelerators. However, several major challenges remain for migrating existing software to heterogeneous reconfigurable architectures. The EXTRA project aims to develop an integrated environment for developing and programming reconfigurable architectures. The EXTRA platform enables the joint optimization of architecture, tools, and reconfiguration technology, and targets the future High Performance Computing hardware nodes. In this paper, we present four innovative EXTRA technologies: (1) a hardwares-software co-design framework;(2) a parallel memory system;(3) a decoupled access execute framework for reconfigurable technology;and (4) transparent access and virtualization of reconfigurable hardware accelerators. Moreover, we describe how the EXTRA technologies targeting the Amazon F1 cloud compute instances can be used in medical applications such as the retinal image segmentation.
Mutual Information (MI) and Transfer Entropy (TE) algorithms compute statistical measurements on the information shared between two dependent random processes. These measurements have focused on pairwise computations ...
详细信息
ISBN:
(纸本)9781450364942
Mutual Information (MI) and Transfer Entropy (TE) algorithms compute statistical measurements on the information shared between two dependent random processes. These measurements have focused on pairwise computations of time series in a broad range of fields, such as Econometrics, Neuroscience, Data Mining and computer Vision. Unlike previous works which mostly focus on 8-bit computer Vision applications, this work proposes the first generic hardware architectures for the acceleration of the MI and TE algorithms to target any dataset for a realistic, multi-FPGA platform. We evaluate and compare two such systems, the Maxeler MAX3A Vectis and the Convey HC-2ex platforms, and provide insight into each one's benefits and limitations. All reported results are from actual experimental runs, including I/O overhead, and comprise lower bounds of our systems' full capabilities for large-scale datasets. These are compared to equivalent optimized multi-threaded software implementations, yielding similar to 19x speedup vs. out-of-the-box software packages and similar to 2.5x speedup vs. highly optimized software that is presented in the related work. These hardware architectures are obtained with a small fraction of the FPGA resources, and are limited by I/O bandwidth. This means that with near-future FPGA I/O capabilities, the performance of the architectures presented in this work for the O(n(2)) Mutual Information and the O(n(3)) Transfer Entropy problems will easily scale up.
This paper presents a Model-Based Design (MBD) methodology as a promising approach for the rapid and se-cure development of embedded System applications, including those involving digital controllers. Based on existin...
详细信息
ISBN:
(数字)9781728159539
ISBN:
(纸本)9781728159546
This paper presents a Model-Based Design (MBD) methodology as a promising approach for the rapid and se-cure development of embedded System applications, including those involving digital controllers. Based on existing hardware and software application-oriented tools by STMicroelectronics, a new modeling technique has been implemented to move from a traditional design workflow to an MBD one by using Mathworks ® software platform. For a practical application of the proposed approach, a case study is reported to design a Permanent Magnet Synchronous Motor (PMSM) drive based on the STM32 MCU family. The main contribution of this paper is to make focus on concretely building a unique model architecture of a complex software/hardware system able to implement all the major aspects of MBD methodology such as executable requirements, Normal and Processor-In-the-Loop (PIL) simulation modes, continuous test and verification, automatic code generation. To achieve this purpose, a new Simulink ® blockset dedicated to the STM32 Motor Control ecosystem was developed. It includes Simulink blocks for maths, algorithms, and IPs, electronic circuitries, MCU peripherals, speed sensors. Simulink ® embedded Coder tool has been used to automatically generate code for a specific STM32 MCU tar-get to overcome the time consuming and error-prone problems of the handwritten coding.
The proceedings contain 50 papers. The special focus in this conference is on Adaptive Instructional systems. The topics include: Ibigkas! 2.0: Directions for the Design of an Adaptive Mobile-Assisted Language Learnin...
ISBN:
(纸本)9783030223403
The proceedings contain 50 papers. The special focus in this conference is on Adaptive Instructional systems. The topics include: Ibigkas! 2.0: Directions for the Design of an Adaptive Mobile-Assisted Language Learning App;Adaptive Learning Technology for AR Training: Possibilities and Challenges;intelligent Tutoring Design Alternatives in a Serious Game;missing Pieces: Infrastructure Requirements for Adaptive Instructional systems;standards Needed: Competency modeling and Recommender systems;measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation;Capturing AIS Behavior Using xAPI-like Statements;standardizing Unstructured Interaction Data in Adaptive Instructional systems;exploring Methods to Promote Interoperability in Adaptive Instructional systems;adaptive Team Training for One;Examining Elements of an Adaptive Instructional System (AIS) Conceptual Model;interoperability Standards for Adaptive Instructional systems: Vertical and Horizontal Integrations;integrating Engagement Inducing Interventions into Traditional, Virtual and embedded Learning Environments;productive Failure and Subgoal Scaffolding in Novel Domains;adaptation and Pedagogy at the Collective Level: Recommendations for Adaptive Instructional systems;developing an Adaptive Trainer for Joint Terminal Attack Controllers;using an Adaptive Intelligent Tutoring System to Promote Learning Affordances for Adults with Low Literacy Skills;development of Cognitive Transfer Tasks for Virtual Environments and Applications for Adaptive Instructional systems;application of Theory to the Development of an Adaptive Training System for a Submarine Electronic Warfare Task;learning Analytics of Playing Space Fortress with Reinforcement Learning;adaptive Training: Designing Training for the Way People Work and Learn;wrong in the Right Way: Balancing Realism Against Other Constraints in simulation-Based Training;adaptive Remediation with Multi-modal Content.
Because of the growing concern towards the energy consumption of embedded devices, the quality of an application is now considered as a new tunable parameter during the implementation phase. Approximations are then de...
详细信息
ISBN:
(纸本)9781450364942
Because of the growing concern towards the energy consumption of embedded devices, the quality of an application is now considered as a new tunable parameter during the implementation phase. Approximations are then deliberately introduced to gain performance. Nevertheless, when implementing an approximate computing technique, quality deteriorations may appear. In order to check that the application Quality of Service is still met despite the induced approximations, several metrics can be used. The proposed method introduces an algorithm-level approximate computing method in a stereovision algorithm. The proposed algorithm-level approximation aims at reducing the computational load in a stereo matching algorithm that outputs a depth map from two rectified images. Based on a smart loop perforation technique, this method offers an interesting quality/complexity trade-off. However, when comparing the obtained results to a more basic approximation technique, the results show that the quality/computation time trade-off is strongly dependent on the metric used. Our paper presents the impact of the choice of the quality metric on the results of the proposed approximate computing technique.
Full-system emulators allow the execution of guest operating systems and applications without the need of having access to the real target hardware. For many applications, besides the correct functional modeling, the ...
详细信息
ISBN:
(数字)9781728195353
ISBN:
(纸本)9781728195360
Full-system emulators allow the execution of guest operating systems and applications without the need of having access to the real target hardware. For many applications, besides the correct functional modeling, the full-system emulator shall also be time-accurate. In this paper, we present a new full-system multi-core simulator that delivers time-accurate execution and preserves the functional correctness of guest application. The proposed solution is based on QEMU. We enriched QEMU with various time models of multi-core platforms. We call this new full-system simulator mcQEMU. mcQEMU supports guest CPUs with out-of-order and in-order *** validated mcQEMU by emulating multi-core ARM processors in system mode. The time accuracy of mcQEMU is evaluated with the TACLeBench benchmark suite. From a timing prediction viewpoint, mcQEMU achieves an estimation error of only 15% in average when emulating the out-of-order ***6Quad processor by NXP. For full-system simulation, mcQEMU runs at 35 Mips for in-order architectures and 25 Mips for out-of-order ones. In user-mode simulation, mcQEMU can achieve up to 65 Mips.
Traditional software testing methods are inefficient in cases where data inputs alone do not determine the outcome of a program's execution. In order to verify such software, testing is often complemented by analy...
详细信息
ISBN:
(纸本)9781450364942
Traditional software testing methods are inefficient in cases where data inputs alone do not determine the outcome of a program's execution. In order to verify such software, testing is often complemented by analysis of the execution trace. For monitoring the execution trace, most approaches today insert additional instructions at the binary level, making the monitoring intrusive. Binary instrumentation operate on a low level, making it difficult to properly modify a program's states and to quantify its code coverage. In this paper, we present a framework for testing complex embedded multithreaded software on the logical level. Testing software on this level avoids dependency on concrete compilers and relates the execution to the source code, thus enabling coverage. Our non-intrusive execution monitoring and control is implemented using the LLVM interpreter compiler infrastructure. Instead of forcing thread interleaving, we suggest simulating interleaving effects through non-intrusive changes of shared variables. This makes it possible to test a single thread without executing the full software stack, which is especially useful in situations where the full software stack is not available (e.g., pre-integration testing). We complement existing approaches with new features such as dynamic configuration of monitoring and execution roll-back to the checkpoints. Our approach introduces acceptable overhead without any complex setup.
Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. In this paper, we propose a novel group-level pruning method to accelerate deep neural network...
详细信息
Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. In this paper, we propose a novel group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques can not achieve the desired accuracy at high sparsity. In this paper, we propose a unaligned approach to improve the accuracy of compressed model.
Driving behavior estimation in car-following scenario based on contextual traffic information is an essential capability for autonomous driving systems. Real-time motion planning based on incomplete environment percep...
详细信息
ISBN:
(数字)9781728169262
ISBN:
(纸本)9781728169279
Driving behavior estimation in car-following scenario based on contextual traffic information is an essential capability for autonomous driving systems. Real-time motion planning based on incomplete environment perception requires complicated probabilistic model for interactions with surrounding objects and road conditions. Hidden Markov Model (HMM) with Gaussian emissions has been used to model driving behaviors for its ability of inferring unobserved states. While the high-dimensional contextual data is continuously processed, the system should be high-performance and power-efficient to make real-time decisions for safe operations. Field Programmable Gate Array (FPGA) is being increasingly used on embedded System-on-Chip (SoC) for mobile applications mainly because of its parallel computation and low-power consumption. This paper implements FAuto: the framework of HMM coupled with GMM algorithm on a Xilinx PYNQ-Z2 board for autonomous systems. We design the hybrid GMM-HMM model in python, and train the model using Next Generation simulation (NGSIM) trajectory data on a CPU platform. The hardware accelerator is designed through Vivado HLS 2018.2, and verified with Jupiter notebook. FAuto achieves 2.59 TOPS/W power efficiency, and 10.39× speedup compared to Python software implementation running on quad-core i7-7500U CPU.
暂无评论