This paper presents a 6-bit 800MS/s successive approximation register (SAR) analog-to digital converter (ADC) in 28nm CMOS with grouped digital-to-analog converter (DAC) capacitor array. High-speed operation is achiev...
详细信息
In this work, we explored an efficient automatic layout routing algorithm for connecting the power and ground pins in analog integrated circuits. A rectilinear minimal spanning tree (RMST) algorithm for two sets of pi...
详细信息
High-dimensional analog circuit sizing with machine learning-based surrogate models suffers from the high sampling cost of evaluating expensive black-box objective functions in huge design spaces. This work addresses ...
详细信息
ISBN:
(数字)9798350352030
ISBN:
(纸本)9798350352047
High-dimensional analog circuit sizing with machine learning-based surrogate models suffers from the high sampling cost of evaluating expensive black-box objective functions in huge design spaces. This work addresses the sampling efficiency challenge by elaborately reducing the dimensionality of the input spaces, enabling efficient optimization for automated analog circuit sizing. We propose a latent space optimization method that includes an iteratively updated generative model based on a variational autoencoder to embed the solution manifold of analog circuits to a low-dimensional and continuous space, where the latent variables are optimized using Bayesian optimization. The effectiveness of the proposed method has been verified on two real-world analog circuits with 18 and 59 design variables. In comparison with BO in the original high-d spaces or latent low-d spaces assisted by other embedding strategies, the proposed method achieves 23%~73% improvements in optimization per-formance within the same runtime limitations. We also conduct a technology migration experiment using the pre-trained variational autoencoder model, which demonstrates the necessity of pre-training and the scalability of the proposed method.
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion...
详细信息
Coarse-Grained Reconfigurable Architecture (CGRA) is a domain-specific reconfigurable architecture. Generally, the CGRA architecture consists of IO, memory, coarse-grained processing element (PE), and interconnect. Us...
Coarse-Grained Reconfigurable Architecture (CGRA) is a domain-specific reconfigurable architecture. Generally, the CGRA architecture consists of IO, memory, coarse-grained processing element (PE), and interconnect. Usually, ALU in PE contains a relatively complete set of operations and most of the interconnects adopt neighbor-to-neighbor (N2N) [1], switch-based [2], and combination of the connection box and switch box (CB-SB) patterns [3]. However, the complex operation sets and switch-based/CB-SB fully-connected interconnects provide sufficient reconfigurability at the cost of resource overhead. Thus, it is important to build a parameterized architecture of CGRA to achieve a balance among hardware overhead, flexibility and performance through automatic design space exploration (DSE).
As design goes into multi-billion transistors, the synthesis runtime becomes an important issue, particularly for design verification and prototyping, as one may run the synthesis many times with design change. Module...
详细信息
In this work, we have explored the use of hierarchical optimization technique to automatically design low dropout regulator. When we use genetic algorithm for multi-objective optimization, the increase of design varia...
详细信息
ISBN:
(数字)9798331513351
ISBN:
(纸本)9798331513368
In this work, we have explored the use of hierarchical optimization technique to automatically design low dropout regulator. When we use genetic algorithm for multi-objective optimization, the increase of design variables will greatly increase the difficulty of optimization. Therefore, we divide the whole circuit into different levels and only upload the sub circuits’ combination of design parameters to reduce the design variables and shorten the optimization time. In our work, we have explored the optimization results under different constraints, and compared them with the direct optimization results systematically.
Neural Radiance Field (NeRF) is a state-of-the-art algorithm in the field of novel view synthesis and has the potential to be used in AR/VR. However, the inference of NeRF is time-consuming. Motivated by resource-cons...
Neural Radiance Field (NeRF) is a state-of-the-art algorithm in the field of novel view synthesis and has the potential to be used in AR/VR. However, the inference of NeRF is time-consuming. Motivated by resource-constraint scenarios on the edge and mixed reality devices, our essential idea is to bridge this gap while improving throughput and power consumption. This paper proposes a high-performance FPGA-based accelerator, with a fully-pipelined design tailored for the vanilla NeRF algorithm. We also design a mechanism to monitor the output of the rendering module to reduce operations. Experimental results show that our accelerator achieves 3.63× energy efficiency over implementation on GPU NVIDIA V100, and 1.31× speed up over state-of-the-art asic design if running under the same clock frequency as asic.
The COordinate Rotation DIgital Computer (CORDIC) simplifies the elementary function using bit-shift operation and addition. However, the iteration increases with the accuracy and causes a long latency. In this paper,...
The COordinate Rotation DIgital Computer (CORDIC) simplifies the elementary function using bit-shift operation and addition. However, the iteration increases with the accuracy and causes a long latency. In this paper, we propose a decision-based CORDIC hardware for arc tangent calculation, which introduce a comparator to determine the necessity of the rotation in every iteration. By bypassing the redundant processes, the proposed CORDIC achieve a higher accuracy. Besides, the regular structure is also friendly for folding implementation. Experiments show that compare with the conventional CORDIC, the decision-based CORDIC hardware has an improvement of 28.1% in accuracy. Synthesis results in 65nm technology shows that the proposed unfolding hardware consumes an area of 0.774mm 2 and a power of 0.644mW under 65nm technology, which has an improvement of about 13.7% and 3% in comparison with the conventional CORDIC, and the folding structure has an area of 0.164mm 2 and a power of 0.243mW .
There are already some works on accelerating transformer networks with field-programmable gate array (FPGA). However, many accelerators focus only on attention computation or suffer from fixed data streams without fle...
There are already some works on accelerating transformer networks with field-programmable gate array (FPGA). However, many accelerators focus only on attention computation or suffer from fixed data streams without flexibility. Moreover, their hardware performance is limited without schedule optimization and full use of hardware resources. In this article, we propose a flexible and efficient FPGA-based overlay processor, named FET-OPU. Specifically, we design an overlay architecture for general accelerations of transformer networks. We propose a unique matrix multiplication unit (MMU), which consists of a processing element (PE) array based on modified DSP-packing technology and a FIFO array for data caching and rearrangement. An efficient non-linear function unit (NFU) is also introduced, which can calculate arbitrary single input non-linear functions. We also customize an instruction set for our overlay architecture, dynamically controlling data flows by instructions generated on the software side. In addition, we introduce a two-level compiler and optimize the parallelism and memory allocation schedule. Experimental results show that our FET-OPU achieves 7.33-21.27× speedup and 231× less energy consumption compared with CPU, and 1.56-4.08× latency reduction with 5.85-66.36× less energy consumption compared with GPU. Furthermore, we observe 1.56-8.21× better latency and 5.28-6.24× less energy consumption compared with previously customized FPGA/asic accelerators and can be 2.05× faster than NPE with 5.55× less energy consumption.
暂无评论