In this work, we propose a novel device based on bulk silicon substrate and use it as a dynamic random access memory (DRAM) with TCAD simulation. The operation principle of this device is similar to that of Z 2 -FET w...
详细信息
ISBN:
(数字)9781728165585
ISBN:
(纸本)9781728165592
In this work, we propose a novel device based on bulk silicon substrate and use it as a dynamic random access memory (DRAM) with TCAD simulation. The operation principle of this device is similar to that of Z 2 -FET which was demonstrated previously in SOI substrate. In our device, LDD doping and gate control are combined to build up the carrier injection barriers which enable feedback mechanism. The designed device shows similar sharp switch and gate-controlled hysteresis in its output characteristics. It is further demonstrated for DRAM application without need of extra capacitor. The DRAM operation shows high speed and reasonably long retention time.
A unijunction transistor based on fully-depleted silicon-on-insulator substrate is proposed. The device structure is similar to a junction field effect transistor. By conducting the TCAD simulation, we observe sharp s...
详细信息
In this paper, a coding framework VIP-ICT-Codec is introduced. Our method is based on the VTM (Versatile Video Coding Test Model). First, we propose a color space conversion from RGB to YUV domain by using a PCA-like ...
详细信息
ISBN:
(数字)9781728193601
ISBN:
(纸本)9781728193618
In this paper, a coding framework VIP-ICT-Codec is introduced. Our method is based on the VTM (Versatile Video Coding Test Model). First, we propose a color space conversion from RGB to YUV domain by using a PCA-like operation. A method for the PCA mean calculation is proposed to de-correlate the residual components of YUV channels. Besides, the correlation of UV components is compensated considering that they share the same coding tree in VVC. We also learn a residual mapping to alleviate the over-filtered and under-filtered problem of specific images. Finally, we regard the rate control as an unconstraint Lagrangian problem to reach the target bpp. The results show that we achieve 32.625dB at the validation phase.
PCB routing becomes time-consuming as the complexity of PCB design increases. Unlike traditional schemes that treat the two essential PCB routing processes separately, namely, escape and bus routing, we consider the c...
PCB routing becomes time-consuming as the complexity of PCB design increases. Unlike traditional schemes that treat the two essential PCB routing processes separately, namely, escape and bus routing, we consider the continuity between them and present a golden-pin-based routing scheme to find the desired solution with angle and topology constraints. Further, conventional rip-up and reroute methods are often ineffective and inefficient for congestion alleviation and routability optimization. We construct a component graph by modeling components as vertices and applying the minimum weight vertex covering method to improve the routability. A self-adaptable ordering method is presented for escape routing to arrange the pin order on the component boundary, guaranteeing successful bus routing. In addition, escape routing is performed based on a disjoint path method. We construct a dynamic Hanan grid in bus routing and utilize a novel congestion adjustment technique to improve solution quality. Compared with FreeRouting and Allegro, the experiment results show that our algorithm achieves high routability and a significant 90% runtime reduction.
In this paper, we present an optimized face recognition algorithm for edge computing. We replace the original VGG16 with a more compact MobileNet to obtain a much smaller SSD model for face detection, and use a tradit...
In this paper, we present an optimized face recognition algorithm for edge computing. We replace the original VGG16 with a more compact MobileNet to obtain a much smaller SSD model for face detection, and use a traditional 2-D face key point detection to obtain an aligned face image. In face representation, we re-train the SphereFace using FP16 instead of FP32 to further reduce resource consumption. We then build the face detection and face representation modules on a Movidius NCS2 device for acceleration, and build another face alignment module on a Raspberry Pi 3B+ embedded system. Experimental results show that our optimized face recognition has successfully raised its processing speed for over 60 times, and is able to perform a real-time face recognition at 7.031 FPS with a high accuracy of 93%. Moreover, the whole computing system has a low consumption of 6.7W, 17 times lower than similar solutions using CPU and GPU.
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks ar...
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has approaches for graph data have emerged. Graph convolution networks (GCNs) in particular, try to replicate the success of CNN in graph data by defining graph convolutions via graph spectral theory or spatial locality. This paper presents a new GCN accelerator for the graph convolution layers of ST-GCN [1], which is successfully applied in action recognition. The accelerator breaks down the graph convolution into convolution and matrix multiplication with adjacency matrix. To optimize the power efficiency, the dataflow of the convolution is designed properly and a sparse matrix-vector multiplication is proposed to make use of the sparsity of the adjacency matrix. The accelerator is implemented on NSA.241 accelerator and can reach its peak performance of 46.0GOP/s under 188MHz with about 220 DSP, which achieves high DSP efficiency.
In the versatile video coding (VVC) proposed by the Joint Video Exploration Team (JVET), the quad-tree with the nested multi-type tree (QTMT) partition scheme has been adopted based on the quadtree structure in the hi...
In the versatile video coding (VVC) proposed by the Joint Video Exploration Team (JVET), the quad-tree with the nested multi-type tree (QTMT) partition scheme has been adopted based on the quadtree structure in the high efficiency video coding (HEVC). The video coding quality of VVC is better than the HEVC, but the algorithm complexity has also increased greatly. In this work, we present an adaptive CU split decision for intra frame with the pooling-variable convolutional neural network (CNN), targeting at various coding unit (CU) shape. The shape-adaptive CNN is realized by the variable pooling layer size where we can make the most of the pooling layer in CNN and retain the original information. Based on the proposed CNN, the CU split or not will be decided by only one trained network, same architecture and parameters for the CUs with multiple sizes. Moreover, with the proposed shape-based CNN training scheme, the various training sample size can be processed successfully. The CU-based network can avoid the full rate-distortion optimization for the CU split and the CU-level rate control can also be enabled. The experiment results show that the proposed method can save 33% coding time with only 0.99% Bjontegaard Delta bitrate (BD-rate) increase.
Quick design is attracting considerable interests as the IC design scale increases. In this paper, we propose a generic parallel global placement strategy on CPU for the acceleration of VLSI placement. Analytical plac...
Quick design is attracting considerable interests as the IC design scale increases. In this paper, we propose a generic parallel global placement strategy on CPU for the acceleration of VLSI placement. Analytical placement methods with nonlinear wirelength model achieve high-quality placements while involve numerous arithmetic computations. We decompose the essential operations including objective function and its gradients evaluation into subtasks and parallelize them using thread pool. Parallel reduction is introduced to avoid resource contention and thus further improves parallel efficiency. The combination of the two techniques leads to a significant speedup with respectable scalability. Experimental results demonstrate that the proposed algorithm can achieve 24× speedup on 32 CPU cores without significant wirelength degradation.
As an essential step in High-Level Synthesis (HLS) tools, scheduling algorithm has displayed a decisive character in the process that transforms a behavioral specification into a register transfer level (RTL) circuit....
As an essential step in High-Level Synthesis (HLS) tools, scheduling algorithm has displayed a decisive character in the process that transforms a behavioral specification into a register transfer level (RTL) circuit. Scheduling algorithm based on system of difference constraints (SDC) transforms scheduling problem into a special linear-programming mathematic problem by bringing in the conception of scheduling variables and transforming the scheduling constraints into mathematic constraints of integer difference form. In this paper, the method of applying min-cost network flow into scheduling algorithm is proposed. Through experiment, scheduling algorithm applied min-cost network flow has more advantages in time complexity than scheduling based on traditional linear-programming mathematic algorithm. The conclusion of the experiment is that the application of min-cost network flow algorithm in HLS scheduling algorithm is feasible.
A high throughput encoder based on Recursive Convolutional Encoder (RCE) circuits is designed for multi-code Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) codes of CCSDS standard. The use of system registers is redu...
A high throughput encoder based on Recursive Convolutional Encoder (RCE) circuits is designed for multi-code Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) codes of CCSDS standard. The use of system registers is reduced by employing a parallel RCE circuit structure; by configuring these parallel RCE circuit structures, the encoder can support multiple patterns and code rates. The FPGA implementation results show that the design method can be implemented based on the Xilinx VC690T chip, and its normalized registers and LUTs are reduced by 39.4% and 24.4%, respectively, compared with the existing solutions.
暂无评论