The performance gap between the processors and the main memory is continuously widening, known as the memory wall bottleneck. Emerging nonvolatile devices have the ability of in-memory processing, and thus, have the p...
The performance gap between the processors and the main memory is continuously widening, known as the memory wall bottleneck. Emerging nonvolatile devices have the ability of in-memory processing, and thus, have the potential to partially alleviate the memory wall bottleneck. People have adopted nonvolatile devices to build various accelerators that are targeted at different problems and applications. In this work, we adopt one of the emerging nonvolatile devices, the ferroelectric field-effect transistor (FeFET), to build a multifunctional in-memory processing unit, which is named FeMAT. From a structural point of view, FeMAT is an FeFET-based memory array composed of 3T-based cells. From a functional point of view, FeMAT not only is a nonvolatile memory, but also can perform some logic operations (i.e., the processing-in-memory (PIM) mode), binary convolutions (i.e., the binary convolutional neural network (BCNN) acceleration mode) and content searching (i.e., the ternary content-addressable memory (TCAM) mode) in the memory. These functions are seamlessly fused into the FeFET-based memory array and can be configured online without changing the circuit structure. Superior energy efficiency is demonstrated by our experiments and comparisons with a resistive random-access memory (ReRAM) based equivalence, as well as a TCAM and a BCNN accelerator based on complementary metal-oxide-semiconductor (CMOS) devices.
Public key cryptography is important in the global communication digital infrastructure. However, the emergence of quantum computer and Shor algorithm has greatly threatened the security of public key cryptography. Th...
详细信息
ISBN:
(数字)9781728174679
ISBN:
(纸本)9781728174686
Public key cryptography is important in the global communication digital infrastructure. However, the emergence of quantum computer and Shor algorithm has greatly threatened the security of public key cryptography. The CRYSTALS-KYBER, as a lattice-based KEM algorithm, passed three rounds of a global solicitation for post-quantum cryptography algorithms held by the National institute of Standards and technology (NIST). This paper explores the implementation and optimization space of hardware design according to CRYSTALS-KYBER algorithm. We analyze its software code and try different strategies to optimize the hardware implementation, and conduct comparative analysis in terms of area and speed. The experimental results show that the performance can be greatly improved by moderately optimizing the loops. In comparison with optimal results of the work [12], our optimizations improve the performance by up to 74.6% for encapsulation algorithm and 54.4% for decapsulation algorithm.
Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional ind...
详细信息
Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8% respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.
Due to increasing amounts of data and compute resources, the deep learning achieves many successes in various domains. Recently, researchers and engineers make effort to apply the intelligent algorithms to the mobile ...
详细信息
Tracing back the instruction execution sequence to debug a multicore system can be very time-consuming because the relationships of the instructions can be very complex. For instructions that cannot be checked by the ...
详细信息
The poor energy proportionality of server is seen as the principal source for low energy efficiency of modern data centers. We find that different resource configurations of an application lead to similar performance,...
详细信息
The poor energy proportionality of server is seen as the principal source for low energy efficiency of modern data centers. We find that different resource configurations of an application lead to similar performance, but have distinct energy consumption. We call this phenomenon as "performance-equivalent resource configurations (PERC)", and its performance range is called equivalent region (ER). Based on PERC, one basic idea for improving energy efficiency is to select the most efficient configuration from PERC for each application. However, it cannot support every application to obtain optimal solution when thousands of applications are run simultaneously on resource-bounded servers. Here we propose a heuristic scheme, CPicker, based on genetic programming to improve energy efficiency of servers. To speed up convergence, CPicker initializes a high quality population by first choosing configurations from regions that have high energy variation. Experiments show that CPicker obtains above 17% energy efficiency improvement compared with the greedy approach, and less than 4% efficiency loss compared with the oracle case.
Exploring the spatiotemporal patterns of the relationships between urban indicators and urban temperature is essential to improve the mitigation effectiveness when we intend to adjust built environment for moderating ...
详细信息
The technological breakthrough in Generative Adversarial Networks (GAN) has propelled the advancement of content generative applications such as AI-based paintings, style transfer, and music composition. However, in c...
详细信息
The technological breakthrough in Generative Adversarial Networks (GAN) has propelled the advancement of content generative applications such as AI-based paintings, style transfer, and music composition. However, in contrast to previous deep learning models for prediction and categorization, generative networks generally rely on instance normalization (IN) layer for better feature distribution, which performs significantly better than batch normalization(BN) in image style-transfer, image to image translation, etc. Unlike batch or group normalization that can be fused into convolutional layers and ignored during the network inference stage, an instance normalization layer induces intensive computation and memory access. However, prior deep learning accelerator designs for traditional Neural Network and Generative Adversarial Networks mostly focus on the acceleration of convolution and deconvolution layer but lack of support for IN operations, which could become a performance bottleneck on edge devices with insufficient computational power. To address this problem, we propose an inference accelerator for content generation (ACG-Engine) aimed to support the fundamental operations of generative networks, including convolution layers, deconvolution layers, specifically instance normalization layer. We performed a hardware-aware mathematical transformation of the IN operation for less computation complexity and memory-friendliness, so that it can be efficiently mapped to the classic 2D processing element array. Owing to the proposed optimization techniques, ACG-Engine achieves 4.56X speedup and improve power efficiency up to 29X compared to prior baseline acceleration scheme in generative network acceleration. In addition, ACG-Engine can achieve performance comparable to the classic CNN-specific accelerators with negligible power consumption and area overhead.
Weakly supervised object detection (WSOD) that only needs image-level annotations has obtained much attention recently. By combining convolutional neural network with multiple instance learning method, Multiple Instan...
详细信息
ISBN:
(数字)9781728148038
ISBN:
(纸本)9781728148045
Weakly supervised object detection (WSOD) that only needs image-level annotations has obtained much attention recently. By combining convolutional neural network with multiple instance learning method, Multiple Instance Detection Network (MIDN) has become the most popular method to address the WSOD problem and been adopted as the initial model in many works. We argue that MIDN inclines to converge to the most discriminative object parts, which limits the performance of methods based on it. In this paper, we propose a novel Coupled Multiple Instance Detection Network (C-MIDN) to address this problem. Specifically, we use a pair of MIDNs, which work in a complementary manner with proposal removal. The localization information of the MIDNs is further coupled to obtain tighter bounding boxes and localize multiple objects. We also introduce a Segmentation Guided Proposal Removal (SGPR) algorithm to guarantee the MIL constraint after the removal and ensure the robustness of C-MIDN. Through a simple implementation of the C-MIDN with online detector refinement, we obtain 53.6% and 50.3% mAP on the challenging PASCAL VOC 2007 and 2012 benchmarks respectively, which significantly outperform the previous state-of-the-arts.
Although rice cultivation is one of the most important agricultural sources of methane (CH4) and contributes ∼8% of total global anthropogenic emissions, large discrepancies remain among estimates of global CH4 emiss...
详细信息
Although rice cultivation is one of the most important agricultural sources of methane (CH4) and contributes ∼8% of total global anthropogenic emissions, large discrepancies remain among estimates of global CH4 emissions from rice cultivation (ranging from 18 to 115 Tg CH4 yr−1) due to a lack of observational constraints. The spatial distribution of paddy-rice emissions has been assessed at regional-to-global scales by bottom-up inventories and land surface models over coarse spatial resolution (e.g., > 0.5°) or spatial units (e.g., agro-ecological zones). However, high-resolution CH4 flux estimates capable of capturing the effects of local climate and management practices on emissions, as well as replicating in situ data, remain challenging to produce because of the scarcity of high-resolution maps of paddy-rice and insufficient understanding of CH4 predictors. Here, we combine paddy-rice methane-flux data from 23 global eddy covariance sites and MODIS remote sensing data with machine learning to 1) evaluate data-driven model performance and variable importance for predicting rice CH4 fluxes;and 2) produce gridded up-scaling estimates of rice CH4 emissions at 5000-m resolution across Monsoon Asia, where ∼87% of global rice area is cultivated and ∼ 90% of global rice production occurs. Our random-forest model achieved Nash-Sutcliffe Efficiency values of 0.59 and 0.69 for 8-day CH4 fluxes and site mean CH4 fluxes respectively, with land surface temperature, biomass and water-availability-related indices as the most important predictors. We estimate the average annual (winter fallow season excluded) paddy rice CH4 emissions throughout Monsoon Asia to be 20.6 ± 1.1 Tg yr−1 for 2001–2015, which is at the lower range of previous inventory-based estimates (20–32 CH4 Tg yr−1). Our estimates also suggest that CH4 emissions from paddy rice in this region have been declining from 2007 through 2015 following declines in both paddy-rice growing area and emission rates per unit
暂无评论