Crafting accelerators using reconfigurable hardware is a promising way to achieve improved performance and power/energy efficiency. However, deploying reconfigurable accelerators is still cumbersome as it involves ove...
详细信息
ISBN:
(纸本)9781538634370
Crafting accelerators using reconfigurable hardware is a promising way to achieve improved performance and power/energy efficiency. However, deploying reconfigurable accelerators is still cumbersome as it involves overall system integration issues and and runtime reconfigurable resource management. We describe the design and implementation of RACOS, a Reconfigurable ACcelerator OS, that provides a simple and intuitive software interface to load/unload reconfigurable hardware accelerators and perform data I/Os transparently to the user. Multiple partially reconfigurable regions are supported, and each region can host either single- or dual-threaded accelerators, effectively virtualizing the reconfigurable resources. RACOS allows multiple applications to use one or more accelerators each, and schedules accelerators for execution according to four policies: simple and inorder that respect the order of request, and out of order and forced that aim to reduce the number of reconfigurations. We evaluate our proposed system varying the number of instances of an accelerated application and show that, despite its generality, RACOS can achieve both high reconfiguration and data communication throughput, close to the maximum reported in bibliography, with a very small resource cost comparable or better than the current state of the art.
Due to its computational complexity, the Scale-Invariant Feature Transform (SIFT) algorithm poses a challenge for use in embedded applications. To meet real-time at low power, hardware acceleration is necessary. This ...
详细信息
ISBN:
(纸本)9781538634370
Due to its computational complexity, the Scale-Invariant Feature Transform (SIFT) algorithm poses a challenge for use in embedded applications. To meet real-time at low power, hardware acceleration is necessary. This paper presents an FPGA-based balanced processor system for real-time SIFT feature detection, containing a dedicated hardware coprocessor coupled to a custom VLIW soft-core processor using a FIFO memory. The coprocessor calculates the scale-space and performs the extrema detection for the extraction of feature candidates, whereas the VLIW soft-core processor performs sub-pixel localization and stability checks to get stable SIFT-features. The system achieves a peak frame rate of up to 338 fps on 1,024x376 px images at less than 3 W on a Xilinx Virtex-6 FPGA. The filters within the Gaussian pyramid operate in a time-multiplexed scheme on clock frequencies up to 400 MHz. Furthermore, this paper presents a comprehensive design space exploration, evaluating architectural performance, hardware resources and power consumption trade-offs as well as exposing performance-balanced and pareto-optimal design variants.
Deep neural networks (DNNS) have obtained compelling performance among many visual tasks by a significant increase in the computation and memory consumption, which severely impede their applications on resource-constr...
详细信息
ISBN:
(数字)9781728109459
ISBN:
(纸本)9781728109466
Deep neural networks (DNNS) have obtained compelling performance among many visual tasks by a significant increase in the computation and memory consumption, which severely impede their applications on resource-constrained systems like smart mobiles or embedded devices. To solve these problems, recent efforts toward compressing DNNS have received increased focus. In this paper, we proposed an effective end-to-end channel pruning approach to compress DNNS. To this end, firstly, we introduce additional auxiliary classifiers to enhance the discriminative power of shallow and intermediate layers. Secondly, we impose Ll-regularization on the scaling factors and shifting factors in batch normalization (BN) layer, and adopt the fast and iterative shrinkage-thresholding algorithm (FISTA) to effectively prune the redundant channels. Finally, by forcing selected factors to zero, we can prune the corresponding unimportant channels safely, thus obtaining a compact model. We empirically reveal the prominent performance of our approach with several state-of-theart DNNS architectures, including VGGNet, and MobileNet, on different datasets. For instance, on cifar10 dataset, the pruned MobileNet achieves 26. 9x reduction in model parameters and 3. 9x reduction in computational operations with only 0.04% increase of classification error.
The Number Theoretic Transform (NTT) is a necessary part of most Lattice-based cryptographic schemes. In particular, it offers an efficient means to achieve polynomial multiplication within the more efficient ring-bas...
详细信息
ISBN:
(纸本)9781538634370
The Number Theoretic Transform (NTT) is a necessary part of most Lattice-based cryptographic schemes. In particular, it offers an efficient means to achieve polynomial multiplication within the more efficient ring-based schemes. The NTT is also a crucial component which needs to be implemented in a critical way, since it is often the bottle-neck and the most resource consuming block of the whole design. As a result, the NTT is an appealing target for exploring different architectures and design trade-offs. In this paper, we compare various optimization strategies applied to maximize the performance or to reduce the resource utilization. Our analysis covers general purpose processors as well as dedicated hardware implemented on reconfigurable platforms and on ASIC. Previously explored design strategies range from the traditional computation where the multiplicative factors (called twiddle factors) are calculated on-the-fly versus memory trade-off exploration (using memory to store pre-computed twiddle factors), to the use of different butterfly designs for implementing the Fast Fourier Transform and its inverse in software, or the sharing of resources for hardware implementations of the forward and inverse NTT. The problem of side channel resistance is also addressed, discussing designs which are robust against power analysis attacks.
Autonomous Vehicles (AVs) are expected to provide relevant benefits to the society in terms of safety, efficiency and accessibility. However, AVs are safety-critical systems, and it is mandatory to assure that they ar...
详细信息
ISBN:
(纸本)9783319992297;9783319992280
Autonomous Vehicles (AVs) are expected to provide relevant benefits to the society in terms of safety, efficiency and accessibility. However, AVs are safety-critical systems, and it is mandatory to assure that they are going to be safe when operating on public roads. However, the safety of AV is still an open, and challenging issue. A combination of simulation, test track, and on-road testing approaches is being recommended to validate the AV safety performance. Testing AVs in real-world scenarios is a widely used, but neither an efficient nor a safe approach to validate safety. Therefore, simulation-based approaches are demanded. Motivated by this challenge, we have developed a simulation-based safety analysis framework, based on open-source tools, to be applied to the future of the road transportation systems. However, the open-source tools we have adopted for the framework have limitations to model real-world elements, especially perception sensors. We thus here present the extensions made to these open-source tools, focused on the development of a perception sensor model in the native OpenDS tool, which enables detecting obstacles around the vehicle, considering the same main characteristics observed in Radar and LiDAR sensors. As the main conclusion, these tools enhancements have improved the simulation based safety analysis framework capabilities for modeling, simulating and analyzing -in a more precise way and for safety validation purposes - the behavior of AV in simulated traffic scenarios when different embedded detection sensor characteristics are considered in its deployment.
Spark is one of the most widely used frameworks for data analytics that offers fast development of applications like machine learning and graph computations in distributed systems. In this paper, we present SPynq: A f...
详细信息
ISBN:
(纸本)9781538634370
Spark is one of the most widely used frameworks for data analytics that offers fast development of applications like machine learning and graph computations in distributed systems. In this paper, we present SPynq: A framework for the efficient utilization of hardware accelerators over the Spark framework on heterogeneous MPSoC FPGAs, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark that hides the accelerator's details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.
Transport Triggered Architecture (TTA) processors allow unique low level compiler optimizations such as software bypassing and operand sharing. Previously, these optimizations have mostly been performed inside single ...
详细信息
ISBN:
(纸本)9781538634370
Transport Triggered Architecture (TTA) processors allow unique low level compiler optimizations such as software bypassing and operand sharing. Previously, these optimizations have mostly been performed inside single basic blocks, leaving much of their potential unused. In this work, software bypassing and operand sharing are integrated with loop scheduling, allowing optimizations over loop iteration boundaries. This considerably further reduces register file accesses and immediate value transfers on tight loops - in some cases even eliminating all register file accesses from the loop body. In the benchmarked 12 small loops, compared to traditional VLIW-style processors, on average 63% of register file reads and 77% of register file writes could be eliminated. Compared to a compiler which performs these optimizations only inside a basic block, on average 58% of register file reads, 28% of register file writes could be eliminated. The additional register access reductions allow both direct energy savings from fewer register accesses and indirect energy savings by allowing the use of simpler register files with less read and write ports and a simpler interconnect network with less transport buses.
Deep neural networks have been widely applied in many areas, such as computer vision, natural language processing and information retrieval. However, due to the high computation and memory demands, deep learning appli...
详细信息
ISBN:
(纸本)9781538634370
Deep neural networks have been widely applied in many areas, such as computer vision, natural language processing and information retrieval. However, due to the high computation and memory demands, deep learning applications have not been adopted in edge learning. In this paper, we exploit the sparsity in tensors to reduce the computation overheads and memory demands. Unlike other approaches which rely on hardware accelerator designs or sacrifice model accuracy for the performance by pruning parameters, we adaptively partition and deploy the workload to heterogeneous devices to reduce computation and memory requirements and increase computing efficiency. We had implemented our partitioning algorithms in Google's TensorFlow and evaluated on an AMD Kaveri system, which is an HSA-based heterogeneous computing system. Our method has effectively reduced the computation time, cache accesses, and cache miss rates, without impacting the accuracy of the learning models. Our approach achieves 66% and 88% speedup for the lenet-5 model and the lenet-1024-1024 model, respectively. For reducing memory traffic, our approach reduces 71% instruction cache references, 32% data cache references. Our system has also improved cache miss rate from 1.6% to 0.5% during the training of the lenet-1024-1024 model.
Disaggregated computing aims at overcoming the problem of fixed resource proportionality in existing infrastructures while advancing resource allocation to virtual machines, which is currently restricted by the physic...
详细信息
ISBN:
(纸本)9781538634370
Disaggregated computing aims at overcoming the problem of fixed resource proportionality in existing infrastructures while advancing resource allocation to virtual machines, which is currently restricted by the physical boundaries of a server tray. Organizing resources into large homogeneous pools (e.g., compute, memory, accelerators, etc) enables the demand-driven, fine-grained allocation of resources, effectively leading to improved resource utilization and significant power savings. However, the success of this approach relies on how efficiently the underlying resources are utilized by the software application. To facilitate software development in disaggregated computing environments, we introduce a versatile multi-FPGA evaluation platform that can serve as an early exploration tool for the involved trade-offs and execution alternatives given the application at hand. To increase functionality of the proposed development/evaluation platform, we consider three types of building blocks, namely compute, memory, and accelerator ones, providing the developer with the option to instantiate and interconnect them in proportion to the application demands, thus facilitating both compute- and memory-intensive applications. We have implemented a fully fledged prototype platform, based on three interconnected Zynq boards, and rely on a thin user-level API to allocate compute and memory resources on remote blocks, transfer data, and deploy reconfigurable accelerators. As a case study, we employ one of the Seven Dwarfs of Symbolic Computation, the matrix multiply benchmark.
This volume highlights new trends and challenges in research on agents and the new digital and knowledge economy, and includes 23 papers classified into the following categories: business process management, agent-bas...
ISBN:
(纸本)9783319866154
This volume highlights new trends and challenges in research on agents and the new digital and knowledge economy, and includes 23 papers classified into the following categories: business process management, agent-based modeling and simulation, and anthropic-oriented computing. All papers were originally presented at the 11th international KES conference on Agents and Multi-Agent systems Technologies and Applications (KES-AMSTA 2017) held June 2123, 2017 in Vilamoura, Algarve, Portugal. Todays economy is driven by technologies and knowledge. Digital technologies can free, shift and multiply choices, and often intrude on the territory of other industries by providing new ways of conducting business operations and creating value for customers and companies. The topics covered in this volume include software agents, multi-agent systems, agent modeling, mobile and cloud computing, big data analysis, business intelligence, artificial intelligence, social systems, computerembeddedsystems and nature inspired manufacturing, etc., all of which contribute to the modern Digital Economy. The results presented here will be of theoretical and practical value to researchers and industrial practitioners working in the fields of artificial intelligence, collective computational intelligence, innovative business models, the new digital and knowledge economy and, in particular, agent and multi-agent systems, technologies, tools and applications.
暂无评论