Modern embeddedsystems require design methodologies that are productive, yet manage to harvest continuing innovations in embedded processor and sensing technologies. This course outlines techniques for mastering the ...
详细信息
Modern embeddedsystems require design methodologies that are productive, yet manage to harvest continuing innovations in embedded processor and sensing technologies. This course outlines techniques for mastering the design using the new tools and software abstractions based on ARM Cortex M processor family. Methodologies are outlined that enhance productivity in large-scale designs, while also employing the advanced hardware accelerators present in modern embedded processors. We then elaborate the ways to design and use platforms for networked systems that build on multi-sensor integration functions.
As modern embeddedsystems like cars need high-power integrated CPUs-GPU SoCs for various real-time applications such as lane or pedestrian detection, they face greater thermal problems than before, which may, in turn...
详细信息
As modern embeddedsystems like cars need high-power integrated CPUs-GPU SoCs for various real-time applications such as lane or pedestrian detection, they face greater thermal problems than before, which may, in turn, incur higher failure rate and cooling cost. We demonstrate, via experimentation on a representative CPUs-GPU platform, the importance of accounting for two distinct thermal characteristics-the platform's temperature imbalance and different power dissipations of different tasks-in real-time scheduling to avoid any burst of power dissipations while guaranteeing all timing constraints. To achieve this goal, we propose a new Real-Time Thermal-Aware Scheduling (RT-TAS) framework. We first capture different CPU cores' temperatures caused by different GPU power dissipations (i.e., CPUs-GPU thermal coupling) with core-specific thermal coupling coefficients. We then develop thermally-balanced task-to-core assignment and CPUs-GPU co-scheduling. The former addresses the platform's temperature imbalance by efficiently distributing the thermal load across cores while preserving scheduling feasibility. Building on the thermally-balanced task assignment, the latter cooperatively schedules CPU and GPU computations to avoid simultaneous peak power dissipations on both CPUs and GPU, thus mitigating excessive temperature rises while meeting task deadlines. We have implemented and evaluated RT-TAS on an automotive embedded platform to demonstrate its effectiveness in reducing the maximum temperature by 6-12.2 degrees C over existing approaches without violating any task deadline.
Treatment of patients using high-quality precision medicine requires a thorough understanding of the genetic composition of a patient. Ideally, the identification of unique variations in an individual's genome is ...
详细信息
Treatment of patients using high-quality precision medicine requires a thorough understanding of the genetic composition of a patient. Ideally, the identification of unique variations in an individual's genome is needed for specifying the necessary treatment. Variant calling workflow is a pipeline of tools, integrating state of the art software systems aimed at alignment, sorting and variant calling for the whole genome sequencing (WGS) data. This pipeline is utilized for identifying unique variations in an individual's genome (compared to a reference genome). Currently, such a workflow is implemented on high-performance computers (with additional GPUs or FPGAs) or in cloud computers. Such systems are large, have a high cost, and rely on the internet for genome data transfer which makes the system unusable in remote locations unequipped with internet connectivity. It further raises privacy concerns due to processing being carried out in a different facility. To overcome such limitations, in this paper, for the first time, we present a cost-efficient, offline, scalable, portable, and energy-efficient computing system named SWARAM for variant calling workflow processing. The system uses novel architecture and algorithms to match against partial reference genomes to exploit smaller memory sizes which are typically available in tiny processing systems. Extensive tests on a standard benchmark data-set (NA12878 Illumina platinum genome) confirm that the time consumed for the data transfer and completing variant calling workflow on SWARAM was competitive to that of a 32-core Intel Xeon server with similar accuracy, but costs less than a fifth, and consumes less than 40% of the energy of the server system. The original scripts and code we developed for executing the variant calling workflow on SWARAM are available in the associated Github repository https://***/Rammohanty/swaram.
Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embeddedsystems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to ...
详细信息
Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embeddedsystems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items (tensors) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency forMobileNet-V1 inference by 1.91x and 1.31x, respectively.
This paper presents an efficient design methodology and a systematic approach for the implementation of squaring functions with large integers, using small-size embedded multipliers. A general architecture of the squa...
详细信息
ISBN:
(纸本)1424402115
This paper presents an efficient design methodology and a systematic approach for the implementation of squaring functions with large integers, using small-size embedded multipliers. A general architecture of the squarer and a set of equations are derived to aid in the realization. The inputs of the squarer are split into several segments leading to an efficient utilization of the small-size embedded multipliers and reduced number of required addition operations. Various benchmarks were tested for different segments ranging from 2 to 5 targeting Xilinx Spartan-3 FPGA. The synthesis was performed with the aid of the Xilinx ISE 7.1 XST tool. Our approach was compared with the traditional technique using the same tool. The results illustrate that our design approach is very efficient in terms of both timing and area saving. The combinational delay is reduced by an average of 15.8%, and the area saving is about 50% in terms of number of slices and number of 4-input LUTs. Also, the number of required embedded multipliers is reduced by an average of 32.3% compared to the traditional technique.
We present an architectural approach toward energy-efficient synthesis of circuits used in neural processing units. Neural network applications are shown to tolerate varying operand precisions between different inputs...
详细信息
We present an architectural approach toward energy-efficient synthesis of circuits used in neural processing units. Neural network applications are shown to tolerate varying operand precisions between different inputs, accuracy targets, their phases, and learning methods, without significantly impacting the classification accuracy. Using multiple instances of systolic arrays at different precisions, we show that significant energy gains are possible beyond the conventional approach, using the same circuit for all precisions.
暂无评论