A large crack detection dataset of 2446 manually labeled images is established to cover a wide range of noise and to evaluate the performance of end-to-end deep convolutional networks in detecting cracking. Five state...
详细信息
A large crack detection dataset of 2446 manually labeled images is established to cover a wide range of noise and to evaluate the performance of end-to-end deep convolutional networks in detecting cracking. Five state-of-the-art end-to-end deep computer vision architectures for semantic segmentation are trained and evaluated, including Fully Convolutional Network (FCN), Global Convolutional Network (GCN), Pyramid Scene Parsing Network (PSPNet), UPerNet, and DeepLabv3+. For the backbones, the VGG, ResNet, and DenseNet are adopted. Based on the comparison of test set metrics, DeepLabv3+ with the ResNet101 backbone achieved the highest IoU of 0.6298, the highest recall of 0.6834, and the highest F1 score of 0.7732. The influence of database choice and image noise on crack detection performance is reported. Based on the comparison of predicted images, UperNet with ResNet101 backbone shows the highest performance for images with shadings, while DeepLabv3+ with ResNet101 backbone shows the best performance for images with blemishes. The research outcome can provide reference for the application of fast and accurate detection of cracks in civil engineering.
RGB-thermal salient object detection (RGB-T SOD) has unique advantages in terms of handling challenging scenes with cluttered backgrounds, low illumination, and low contrast. However, because they do not consider the ...
详细信息
RGB-thermal salient object detection (RGB-T SOD) has unique advantages in terms of handling challenging scenes with cluttered backgrounds, low illumination, and low contrast. However, because they do not consider the significant differences between different imaging mechanisms and inherent characteristics of thermal images, existing RGB-T SOD methods are generally unable to handle diverse feature fusion demands and may yield unsatisfactory performance. To overcome this problem and achieve more effective RGB-T SOD, we propose an asymmetric cross-modal activation network to exploit the interactions of modality-specific features based on an asymmetric feature fusion strategy. Specifically, a two-stream asymmetric feature aggregation encoder module is proposed to fuse multimodality features adaptively and extract complementary information. The self-attention of multimodality features is leveraged to guide cross-modal interactions, which can propagate long-range contextual dependencies and extract effective saliency cues. Furthermore, a multitask decoder is proposed to achieve SOD and thermal image reconstruction in a unified framework. Salient objects can be located and segmented accurately based on reconstructed high-resolution feature representations. Extensive experiments on public RGB-T and RGB-D SOD datasets demonstrate the superiority of the proposed network and ablation experiments highlight the effectiveness of each component. Our code and saliency maps are available at: ***/xanxuso/ACMANet.(c) 2022 Elsevier B.V. All rights reserved.
Table structure recognition is an important task in document analysis and attracts the attention of many researchers. However, due to the diversity of table types and the complexity of table structure, the performance...
详细信息
ISBN:
(纸本)9783030863319
Table structure recognition is an important task in document analysis and attracts the attention of many researchers. However, due to the diversity of table types and the complexity of table structure, the performances of table structure recognition methods are still not well enough in practice. Row and column separators play a significant role in the two-stage table structure recognition and a better row and column separator segmentation result can improve the final recognition results. Therefore, in this paper, we present a novel deep learning model to detect row and column separators. This model contains a convolution encoder and two parallel row and column decoders. The encoder can extract the visual features by using convolution blocks;the decoder formulates the feature map as a sequence and uses a sequence labeling model, bidirectional long short-term memory networks (BiLSTM) to detect row and column separators. Experiments have been conducted on PubTabNet and the model is benchmarked on several available datasets, including Pub-TabNet, UNLV ICDAR13, ICDAR19. The results show that our model has a state-of-the-art performance than other strong models. In addition, our model shows a better generalization ability. The code is available on this site (www ***/L597383845/row-col-table-recognition).
Automatic extraction of buildings from high-resolution remote sensing imagery is very useful in many applications such as city management, mapping, urban planning and geographic information updating. However, due to t...
详细信息
ISBN:
(纸本)9781665403696
Automatic extraction of buildings from high-resolution remote sensing imagery is very useful in many applications such as city management, mapping, urban planning and geographic information updating. However, due to the general texture of the building and the complexity of the image background, high-precision building segmentation from high-resolution sensing image is still a challenging task. Existing state-of-the-art frameworks use repeated pooling and step operations leading to the loss of detailed information. Thus, high-resolution representations are essential for building extraction. On this basis, our proposed network, named as HRLinkNet, maintains high-resolution representations through the whole process based on the LinkNet. We tested it on WHU Building dataset. Experimental results show that the proposed HRLinkNet is superior to the LinkNet, UNet, DLinkNet, segnet and so on.
This paper develops probabilistic PV forecasters by taking advantage of recent breakthroughs in deep learning. It tailored forecasting tool, named encoder-decoder, is implemented to compute intraday multi-output PV qu...
详细信息
ISBN:
(纸本)9781665435970
This paper develops probabilistic PV forecasters by taking advantage of recent breakthroughs in deep learning. It tailored forecasting tool, named encoder-decoder, is implemented to compute intraday multi-output PV quantiles forecasts to efficiently capture the time correlation. The models are trained using quantile regression, a non-parametric approach that assumes no prior knowledge of the probabilistic forecasting distribution. The case study is composed of PV production monitored on-site at the University of Liege (ULiege), Belgium. The weather forecasts from the regional climate model provided by the Laboratory of Climatology are used as inputs of the deep learning models. The forecast quality is quantitatively assessed by the continuous ranked probability and interval scores. The results indicate this architecture improves the forecast quality and is computationally efficient to be incorporated in an intraday decision-making tool for robust optimization.
Judgment documents contain rich legal information, they are simultaneously lengthy with complex structure. This requires summarizing judgment documents in an effective way. By analyzing the structural features of Chin...
详细信息
ISBN:
(纸本)9783030893910;9783030893903
Judgment documents contain rich legal information, they are simultaneously lengthy with complex structure. This requires summarizing judgment documents in an effective way. By analyzing the structural features of Chinese judgment documents, we propose an automatic summarization method, which consists of an extraction model and an abstraction model. In the extraction model, all the sentences are encoded by a Self-Attention network and are classified into key sentences and non-key sentences. In the abstraction model, the initial summarization is refined into a final summarization by a unidirectional-bidirectional attention network. Such a summarization could help improve the efficiency in case handling and make judgment documents more accessible to the general readers. The experimental results on CAIL2020 dataset are satisfactory.
Electrocardiography (ECG) is a conventional method in arrhythmia diagnosis. In this paper, we proposed a novel neural network model which treats typical heartbeat classification task as 'Translation' problem. ...
详细信息
ISBN:
(纸本)9781728176055
Electrocardiography (ECG) is a conventional method in arrhythmia diagnosis. In this paper, we proposed a novel neural network model which treats typical heartbeat classification task as 'Translation' problem. By introducing Transformer structure into model, and adding heartbeat-aware attention mechanism to enhance the alignment between encoded sequence and decoded sequence, after trained with ECG database, (which are collected from 200k patients in over 2000 hospitals for more than 10 years), the validation result of independent test dataset shows that this new heartbeat-aware Transformer model can outperform classic Transformer and other sequence to sequence methods. Finally, we show that the visualization of encoder-decoder attention weights provides more interpretable information about how a Transformer make a diagnosis based on raw ECG signals, which has guiding significance in clinical diagnosis.
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The b...
详细信息
ISBN:
(纸本)9781728176055
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
In this paper, we propose a blended Attention-Connectionist Temporal Classification (CTC) network architecture for a unique script, Amharic, text-image recognition. Amharic is an indigenous Ethiopic script that uses 3...
详细信息
ISBN:
(纸本)9789897584862
In this paper, we propose a blended Attention-Connectionist Temporal Classification (CTC) network architecture for a unique script, Amharic, text-image recognition. Amharic is an indigenous Ethiopic script that uses 34 consonant characters with their 7 vowel variants of each and 50 labialized characters which are derived, with a small change, from the 34 consonant characters. The change involves modifying the structure of these characters by adding a straight line, or shortening and/or elongating one of its main legs including the addition of small diacritics to the right, left, top or bottom of the character. Such a small change affects orthographic identities of character and results in shape similarly among characters which are interesting, but challenging task, for OCR research. Motivated with the recent success of attention mechanism on neural machine translation tasks, we propose an attention-based CTC approach which is designed by blending attention mechanism directly within the CTC network. The proposed model consists of an encoder module, attention module and transcription module in a unified framework. The efficacy of the proposed model on the Amharic language shows that attention mechanism allows learning powerful representations by integrating information from different time steps. Our method outperforms state-of-the-art methods and achieves 1.04% and 0.93% of the character error rate on ADOCR test datasets.
How to automatically generate diagnostic reports with accurate content, standardized structure and clear semantics, brings great challenges due to the complexity of medical images and the detailed paragraph descriptio...
详细信息
ISBN:
(纸本)9789897584909
How to automatically generate diagnostic reports with accurate content, standardized structure and clear semantics, brings great challenges due to the complexity of medical images and the detailed paragraph descriptions for medical images. The structure and the semantic contents of the historical report are very helpful for the current report generation. This paper proposes a text report generation method assisted by historical reports. In the proposed method, both the previous report and the keywords generated from the current images are modeled by using two encoders respectively. The co-attention mechanism is introduced to jointly learn the historical reports and the keywords. The decoder based on the co-attention is used to generate a long description of the image. The progress that learns from the historical report and the current report in the training set helps to generate an accurate report for the new image. Furthermore, the structure in the historical report helps to generate a more natural text report. We conducted experiments on the practical ultrasound data, which is provided by a prestigious hospital in China. The experimental results show that the reports generated by the proposed method are closer to the reports generated by radiologists.
暂无评论