Many image-to-image computer vision approaches have made great progress by an end-to-end framework with the encoder-decoder architecture. However, the same image-to-image eye fixation prediction task is not the same a...
详细信息
Many image-to-image computer vision approaches have made great progress by an end-to-end framework with the encoder-decoder architecture. However, the same image-to-image eye fixation prediction task is not the same as those computer vision tasks in that it focuses more on salient regions rather than precise predictions for every pixel. Thus, it is not appropriate to directly apply the end-to-end encoder-decoder to the eye fixation prediction task. In addition, although high-level feature is important, the contribution of low-level feature should also be kept and balanced in computational model. Nevertheless, some low-level features that attract attention are easily neglected while transiting through the deep network. Therefore, the effective way to integrate low-level and high-level features for improving eye fixation prediction performance is still a challenging task. In this paper, a coarse-to-fine network (CFN) that encompasses two pathways with different training strategies are proposed: coarse perceiving network (CFN-Coarse) can be a simple encoder network or any of the existing pretrained network to capture the distribution of salient regions and generate high-quality feature maps;fine integrating network (CFN-Fine) uses fixed parameters from the CFN-Coarse and combines features from deep to shallow in the deconvolution process by adding skip connections between down-sampling and up-sampling paths to efficiently integrate deep and shallow features. The saliency map obtained by the method is evaluated over 6 standard benchmark datasets, namely SALICON, MIT1003, MIT300, Toronto, OSIE, and SUN500. The results demonstrate that the method can surpass the state-of-the-art accuracy of eye fixation prediction and achieves the competitive performance to date under most evaluation metrics on SALICON Saliency Prediction Challenge (LSUN2017).
Currently, SLAM (simultaneous localization and mapping) systems based on monocular cameras cannot directly obtain depth information, and most of them have problems with scale uncertainty and need to be initialized. In...
详细信息
Currently, SLAM (simultaneous localization and mapping) systems based on monocular cameras cannot directly obtain depth information, and most of them have problems with scale uncertainty and need to be initialized. In some application scenarios that require navigation and obstacle avoidance, the inability to achieve dense mapping is also a defect of monocular SLAM. In response to the above problems, this paper proposes a method which learns depth estimation by DenseNet and CNN for a monocular SLAM system. We use an encoder-decoder architecture based on transfer learning and convolutional neural networks to estimate the depth information of monocular RGB images. At the same time, through the front-end ORB feature extraction and the back-end direct RGB-D Bundle Adjustment optimization method, it is possible to obtain accurate camera poses and achieve dense indoor mapping when using estimated depth information. The experimental results show that the monocular depth estimation model used in this paper can achieve good results, and it is also competitive in comparison with the current popular methods. On this basis, the error of camera pose estimation is also smaller than traditional monocular SLAM solutions and can complete the dense indoor reconstruction task. It is a complete SLAM system based on monocular camera.
Images captured in low-brightness environments often lead to poor visibility and exhibit artifacts such as low brightness, low contrast, and color distortion. These artifacts not only affect the visual perception of t...
详细信息
Images captured in low-brightness environments often lead to poor visibility and exhibit artifacts such as low brightness, low contrast, and color distortion. These artifacts not only affect the visual perception of the human eye but also decrease the performance of computer vision algorithms. Existing deep learning-based image enhancements studies are quite slow and usually require extensive hardware specifications. Conversely, lightweight enhancement approaches do not provide satisfactory performance as compared to state-of-the-art methods. Therefore, we proposed a fast and lightweight deep learning-based algorithm for performing low-light image enhancement using the light channel of Hue Saturation Lightness (HSL). LiCENt stands for Light Channel Enhancement Network that uses a combination of an autoencoder and convolutional neural network (CNN) to train a low-light enhancer to first improve the illumination and later improve the details of the low-light image in a unified framework. This method used a single channel lightness 'L' of HSL color space instead of traditional RGB color channels which helps in reducing the number of learnable parameters by a factor of 8.92, at the most. LiCENt also has significant advantages for the Brilliance Perception Adjustment, which enables the model to avoid issues including over-enhancement and color distortion. The experimental results demonstrate that our approach generalizes well in synthetic and natural low-light images and outperforms other methods in terms of qualitative and quantitative metrics.
In this paper, we propose a lightweight asymmetric refining fusion network (LARFNet) for real-time semantic segmentation to solve the problem that some existing models cannot achieve good segmentation accuracy with re...
详细信息
In this paper, we propose a lightweight asymmetric refining fusion network (LARFNet) for real-time semantic segmentation to solve the problem that some existing models cannot achieve good segmentation accuracy with real-time inference speed in mobile devices due to the huge computational overhead. Specifically, LARFNet adopts an asymmetric encoder-decoder structure. The depth-wise separable asymmetric interaction module (DSAI module) is designed in the encoding process, which effectively extracted local and surrounding information under different receptive fields with optimized convolution in the condition of ensuring communication between channels. In the decoder, we design the bilateral pyramid pooling attention module (BPPA module) and the multi-stage refinement fusion module (MRF Module). The BPPA module is used to integrate the high-level output multi-scale context information. Based on spatial and channel attention mechanisms, the MRF module is proposed to refine the feature maps of different resolutions and guide the feature fusion. Experimental results show that LARFNet achieves 69.2% mIoU and 65.6% mIoU on Cityscapes and Camvid datasets at 127 FPS and 222 FPS respectively, only using a single NVIDIA GeForce GTX2080Ti GPU and 0.72M parameters without any pre-training or pre-processing. Compared with most of the existing state-of-the-art models, the proposed method realizes the efficient use of network parameters at a faster speed, reduces the number of network parameters, and still achieves the accuracy of good segmentation.(c) 2022 Elsevier Ltd. All rights reserved.
This study investigates the capability of sequence-to-sequence machine learning (ML) architectures in an effort to develop streamflow forecasting tools for Canadian watersheds. Such tools are useful to inform local an...
详细信息
This study investigates the capability of sequence-to-sequence machine learning (ML) architectures in an effort to develop streamflow forecasting tools for Canadian watersheds. Such tools are useful to inform local and region-specific water management and flood forecasting related activities. Two powerful deep-learning variants of the Recurrent Neural Network were investigated, namely the standard and attention-based encoder-decoder long short-term memory (LSTM) models. Both models were forced with past hydro-meteorological states and daily meteorological data with a look-back time window of several days. These models were tested for 10 different watersheds from the Ottawa River watershed, located within the Great Lakes Saint-Lawrence region of Canada, an economic powerhouse of the country. The results of training and testing phases suggest that both models are able to simulate overall hydrograph patterns well when compared to observational records. Between the two models, the attention model significantly outperforms the standard model in all watersheds, suggesting the importance and usefulness of the attention mechanism in ML architectures, not well explored for hydrological applications. The mean performance accuracy of the attention model on unseen data, when assessed in terms of mean Nash-Sutcliffe Efficiency and Kling-Gupta Efficiency is, respectively, found to be 0.985 and 0.954 for these watersheds. Streamflow forecasts with lead times of up to 5 days with the attention model demonstrate overall skillful performance with well above the benchmark accuracy of 70%. The results of the study suggest that the encoder-decoder LSTM, with attention mechanism, is a powerful modelling choice for developing streamflow forecasting systems for Canadian watersheds.
In recent years, how to strike a good trade-off between accuracy, inference speed, and model size has become the core issue for real-time semantic segmentation applications, which plays a vital role in real-world scen...
详细信息
In recent years, how to strike a good trade-off between accuracy, inference speed, and model size has become the core issue for real-time semantic segmentation applications, which plays a vital role in real-world scenarios such as autonomous driving systems and drones. In this study, we devise a novel lightweight network using a multi-scale context fusion (MSCFNet) scheme, which explores an asymmetric encoder-decoder architecture to alleviate these problems. More specifically, the encoder adopts some developed efficient asymmetric residual (EAR) modules, which are composed of factorization depth-wise convolution and dilation convolution. Meanwhile, instead of complicated computation, simple deconvolution is applied in the decoder to further reduce the amount of parameters while still maintaining the high segmentation accuracy. Also, MSCFNet has branches with efficient attention modules from different stages of the network to well capture multi-scale contextual information. Then we combine them before the final classification to enhance the expression of the features and improve the segmentation efficiency. Comprehensive experiments on challenging datasets have demonstrated that the proposed MSCFNet, which contains only 1.15M parameters, achieves 71.9% Mean IoU on the Cityscapes testing dataset and can run at over 50 FPS on a single Titan XP GPU configuration.
The description of context information affected by speckle and class imbalance under labeled data makes the pixelwise classification for high-resolution (HR) synthetic aperture radar (SAR) image a challenging task. To...
详细信息
The description of context information affected by speckle and class imbalance under labeled data makes the pixelwise classification for high-resolution (HR) synthetic aperture radar (SAR) image a challenging task. To address these issues, we propose a global-context pyramidal and class-balanced network (GPCNet) for HR SAR image classification. The proposed structure follows an encoder-decoder architecture. In the encoder module, the multiscale convolutional and global-local cross-channel attention (GCA) blocks are employed to capture the global-context and distinguishable deep feature statistics, while reducing the impacts of the random fluctuation in the homogeneous region. The channel information of different scale convolutional layers is efficiently learned by local cross-channel interaction in the GCA block. Besides, a sampled class-balanced loss, associating with the effective number, is utilized for alleviating the class imbalance of HR SAR images. The experiments carried out on a TerraSAR-X image classification dataset demonstrate GPCNet is able to yield superior performance when compared with other related networks. (C) 2022 Society of Photo-Optical Instrumentation Engineers (SPIE)
The energy load data in the micro-energy network are a time series with sequential and nonlinear characteristics. This paper proposes a model based on the encode-decode architecture and ConvLSTM for multi-scale predic...
详细信息
The energy load data in the micro-energy network are a time series with sequential and nonlinear characteristics. This paper proposes a model based on the encode-decode architecture and ConvLSTM for multi-scale prediction of multi-energy loads in the micro-energy network. We apply ConvLSTM, LSTM, attention mechanism and multi-task learning concepts to construct a model specifically for processing the energy load forecasting of the micro-energy network. In this paper, ConvLSTM is used to encode the input time series. The attention mechanism is used to assign different weights to the features, which are subsequently decoded by the decoder LSTM layer. Finally, the fully connected layer interprets the output. This model is applied to forecast the multi-energy load data of the micro-energy network in a certain area of Northwest China. The test results prove that our model is convergent, and the evaluation index value of the model is better than that of the multi-task FC-LSTM and the single-task FC-LSTM. In particular, the application of the attention mechanism makes the model converge faster and with higher precision.
With the increasing demand for application scenarios such as autonomous driving and drone aerial photography, it has become a challenging problem that how to achieve the best trade-off between segmentation accuracy an...
详细信息
With the increasing demand for application scenarios such as autonomous driving and drone aerial photography, it has become a challenging problem that how to achieve the best trade-off between segmentation accuracy and inference speed while reducing the number of parameters. In this paper, a lightweight and efficient asymmetric network (LEANet) for real-time semantic segmentation is proposed to address this problem. Specifically, LEANet adopts an asymmetric encoder-decoder architecture. In the encoder, a depth-wise asymmetric bottleneck module with separation and shuffling operations (SS-DAB module) is proposed to jointly extract local and context information. In the decoder, a pyramid pooling module based on channel-wise attention (CA-PP module) is proposed to aggregate multi-scale context information and guide feature selection. Without any pre-training and post-processing, LEANet respectively achieves the accuracy of 71.9% and 67.5% mean Intersection over Union (mIoU) with the speed of 77.3 and 98.6 Frames Per Second (FPS) on the Cityscapes and CamVid test sets. These experimental results show that LEANet achieves an optimal trade-off between segmentation accuracy and inference speed with only 0.74 million parameters.
Image captioning is the process of generating a textual description of images, which integrates both computer vision and natural language processing. Approaches based on encoder-decoder architectures have been recentl...
详细信息
ISBN:
(数字)9783031298608
ISBN:
(纸本)9783031298592;9783031298608
Image captioning is the process of generating a textual description of images, which integrates both computer vision and natural language processing. Approaches based on encoder-decoder architectures have been recently proposed to solve image captioning problems. The main objective of this paper is to conduct a comparative study between the two most widely used approaches for natural language processing tasks, namely, LSTMs and Transformers. We used the Flickr8k dataset as input images. Regarding image feature extraction, we used the VGG16 model. To evaluate the obtained descriptions generated by the models, the BLEU score metric is used to measure the performance of both models. The latter were able to generate grammatically correct and expressive captions.
暂无评论