This paper studies multi-turn text-to-SQL generation, which is a new but important task in semantic parsing. In order to deal with its two challenges, i.e., multi-turn interaction and cross-domain evaluation, this pap...
详细信息
This paper studies multi-turn text-to-SQL generation, which is a new but important task in semantic parsing. In order to deal with its two challenges, i.e., multi-turn interaction and cross-domain evaluation, this paper proposes a multiple-integration encoder, which derives the vector representations of user utterances and database schemas using three custom-designed modules for information integration. First, an utterance representation enhancing module is built to integrate the information of history utterances into the representation of each token in current utterance by attentive selection. Second, a schema discrepancy enhancing module is designed to integrate previous predicted SQL query into the representation of schema items. Third, a latent schema linking module is employed to integrate schema information into utterance representations for better dealing with unseen database schemas. These three modules are all implemented based on a lightweight multi-head attention mechanism, which reduces the number of parameters in conventional multi-head attention. Experimental results on the SParC dataset show that our method achieved better accuracy of multi-turn text-to-SQL generation than the most advanced benchmarks. Further ablations studies and analysis also demonstrate the effectiveness of the three modules designed for information integration in the encoder.
Segmenting kidney and kidney tumour from CT scan is crucial in combating challenges in the early detection of kidney cancer. Several segmentation methods are available to segment kidney and kidney tumour from 3D CT sc...
详细信息
ISBN:
(纸本)9798350364866;9798350364873
Segmenting kidney and kidney tumour from CT scan is crucial in combating challenges in the early detection of kidney cancer. Several segmentation methods are available to segment kidney and kidney tumour from 3D CT scan. However, these methods pose several drawbacks, including dependency on pixel-wise classification, limited generalisation, and manual annotation requirements. Hence, this paper introduces a novel kidney and kidney tumour segmentation approach employing encoder-decoder-based architecture. The proposed segmentation approach is assessed against two encoder-decoder-based architectures, namely U-Net and DeepLabv3+. The proposed approach precisely identifies kidney and kidney tumour in a 3D CT scan. Its performance is analysed using the 2023 Kidney and Kidney Tumour Segmentation Challenge (KiTS23) dataset. The evaluation metrics such as dice coefficient and Intersection over Union (IoU) are used to assess the performance. Our results on the KiTS23 dataset show that DeepLabv3+ outperforms U-Net. Thus, the paper discusses the DeepLabv3+ approach in detail. DeepLabv3+ boasts an average improvement of 0.82% in dice coefficient, 1.60% in IoU, 39.28% in loss during training, and 0.94% in dice coefficient, 1.82% in IoU, 44.88% in loss during validation over U-Net.
Deep neural networks have been widely used in medical image analysis and medical image segmentation is one of the most important tasks. U-shaped neural networks with encoder-decoder are prevailing and have succeeded g...
详细信息
ISBN:
(纸本)9781665473583
Deep neural networks have been widely used in medical image analysis and medical image segmentation is one of the most important tasks. U-shaped neural networks with encoder-decoder are prevailing and have succeeded greatly in various segmentation tasks. While CNNs treat an image as a grid of pixels in Euclidean space and Transformers recognize an image as a sequence of patches, graph-based representation is more generalized and can construct connections for each part of an image. In this paper, we propose a novel ViG-UNet, a graph neural network-based U-shaped architecture with the encoder, the decoder, the bottleneck, and skip connections. The downsampling and upsampling modules are also carefully designed. The experimental results on ISIC 2016, ISIC 2017 and Kvasir-SEG datasets demonstrate that our proposed architecture outperforms most existing classic and state-of-the-art U-shaped networks.
Generating textual descriptions of images by describing them in words is a fundamental problem that connects computer vision and natural language processing. A single image may include several entities, their orientat...
详细信息
ISBN:
(纸本)9789819916412;9789819916429
Generating textual descriptions of images by describing them in words is a fundamental problem that connects computer vision and natural language processing. A single image may include several entities, their orientations, appearance, and position in a scene as well as their complex spatial interactions, thus leading to a lot of possible captions for an image. Search algorithm of Beam Search has been employed for the task of sentence for the last couple of decades, although it returns around the similar captions with minor changes of wordings. We came across another search strategy, Diverse M-Best which uses M (M denotes the number of independent, diverse beam searches) beam searches from diverse starting statements and keeps the best output from each beam search, and removes the rest of (B-1) captions. This method would mostly lead us to many possible diverse generated sequences, but running Beam Search M several times would be computationally expensive. With the above stated works in vision, we have devised and implemented a novel algorithm, Modified Beam Search (MBS), for generation of Diverse and better captions, with an increase in the computational complexity as compared to the Beam Search. We obtained improvements on BLEU-3 and BLEU-4 scores by 1-3% over the top-2 predicted captions from the original beam search captions.
The encoder-decoder architecture is widely used as a lightweight semantic segmentation network. However, it struggles with a limited performance compared to a well-designed Dilated-FCN model for two major problems. Fi...
详细信息
The encoder-decoder architecture is widely used as a lightweight semantic segmentation network. However, it struggles with a limited performance compared to a well-designed Dilated-FCN model for two major problems. First, commonly used upsampling methods in the decoder such as interpolation and deconvolution suffer from a local receptive field, unable to encode global contexts. Second, low-level features may bring noises to the network decoder through skip connections for the inadequacy of semantic concepts in early encoder layers. To tackle these challenges, a Global Enhancement Method is proposed to aggregate global information from high-level feature maps and adaptively distribute them to different decoder layers, alleviating the shortage of global contexts in the upsampling process. Besides, aLocal Refinement Module is developed by utilizing the decoder features as the semantic guidance to refine the noisy encoder features before the fusion of these two (the decoder features and the encoder features). Then, the two methods are integrated into a Context Fusion Block, and based on that, a novel Attention guided Global enhancement and Local refinement Network (AGLN) is elaborately designed. Extensive experiments on PASCAL Context, ADE20K, and PASCAL VOC 2012 datasets have demonstrated the effectiveness of the proposed approach. In particular, with a vanilla ResNet-101 backbone, AGLN achieves the state-of-the-art result (56.23% mean IOU) on the PASCAL Context dataset. The code is available at https://***/zhasen1996/AGLN.
Routine visual inspection of concrete structures is essential to maintain safe conditions. Therefore, studies of concrete crack segmentation using deep learning methods have been extensively conducted in recent years....
详细信息
Routine visual inspection of concrete structures is essential to maintain safe conditions. Therefore, studies of concrete crack segmentation using deep learning methods have been extensively conducted in recent years. However, insufficient performance remains a major challenge in diverse field-inspection scenarios. In this study, a novel SegCrack model for pixel-level crack segmentation is therefore proposed using a hierarchically structured Transformer encoder to output multiscale features and a top-down pathway with lateral connections to progressively up-sample and fuse features from the deepest layer of the encoder. Furthermore, an online hard example mining strategy was adopted to strengthen the detection of hard samples and improve the model performance. The effect of dataset size on the segmentation performance was then investigated. The results indicated that SegCrack achieved a precision, recall, F1 score, and mean intersection over union of 96.66%, 95.46%, 96.05%, and 92.63%, respectively, using the test set.
In this study, we aimed to develop and assess a hydrological model using a deep learning algorithm for improved water management. Single-output long short-term memory (LSTM SO) and encoder-decoder long short-term memo...
详细信息
In this study, we aimed to develop and assess a hydrological model using a deep learning algorithm for improved water management. Single-output long short-term memory (LSTM SO) and encoder-decoder long short-term memory (LSTM ED) models were developed, and their performances were compared using different input variables. We used water-level and rainfall data from 2018 to 2020 in the Takayama Reservoir (Nara Prefecture, Japan) to train, test, and assess both models. The root-mean-squared error and Nash-Sutcliffe efficiency were estimated to compare the model performances. The results showed that the LSTM ED model had better accuracy. Analysis of water levels and water-level changes presented better results than the analysis of water levels. However, the accuracy of the model was significantly lower when predicting water levels outside the range of the training datasets. Within this range, the developed model could be used for water management to reduce the risk of downstream flooding, while ensuring sufficient water storage for irrigation, because of its ability to determine an appropriate amount of water for release from the reservoir before rainfall events.
Water body segmentation is an important tool for the hydrological monitoring of the Earth. With the rapid development of convolutional neural networks, semantic segmentation techniques have been used on remote sensing...
详细信息
Water body segmentation is an important tool for the hydrological monitoring of the Earth. With the rapid development of convolutional neural networks, semantic segmentation techniques have been used on remote sensing images to extract water bodies. However, some difficulties need to be overcome to achieve good results in water body segmentation, such as complex background, huge scale, water connectivity, and rough edges. In this study, a water body segmentation model (DUPnet) with dense connectivity and multi-scale pyramidal pools is proposed to rapidly and accurately extract water bodies from Gaofen satellite and Landsat 8 OLI (Operational Land Imager) images. The proposed method includes three parts: (1) a multi-scale spatial pyramid pooling module (MSPP) is introduced to combine shallow and deep features for small water bodies and to compensate for the feature loss caused by the sampling process;(2) dense blocks are used to extract more spatial features to DUPnet's backbone, increasing feature propagation and reuse;(3) a regression loss function is proposed to train the network to deal with the unbalanced dataset caused by small water bodies. The experimental results show that the F1, MIoU, and FWIoU of DUPnet on the 2020 Gaofen dataset are 97.67%, 88.17%, and 93.52%, respectively, and on the Landsat River dataset, they are 96.52%, 84.72%, 91.77%, respectively.
This paper presents a novel two-stage 3D point cloud object detector named ASCNet for autonomous driving. Most current works project 3D point clouds to 2D space, whereas the quantization loss in the transformation is ...
详细信息
This paper presents a novel two-stage 3D point cloud object detector named ASCNet for autonomous driving. Most current works project 3D point clouds to 2D space, whereas the quantization loss in the transformation is inevitable. A Pillar-wise Spatial Context Feature Encoding (PSCFE) module is proposed in the paper to drive the learning of discriminative features and reduce the detailed information loss. The inhomogeneity that existed in 3D object detection from the point clouds, such as the inconsistent number of points in the pillars, the diverse size of Regions of Interest (RoI), should be treated wisely due to the sparsity and the individual specificity. We introduce a length-adaptive RNN-based module to solve the inhomogeneity. A novel backbone combining encoder-decoder and shortcut connection is designed in the paper to learn the multi-scale features for 3D object detection. Additionally, we utilize multiple RoI heads and class-wise NMS to deal with the class imbalance in scenes. Extensive experiments on the KITTI dataset demonstrate that our algorithm achieves competitive performance in 3D bounding box detection and BEV detection. (c) 2021 Elsevier B.V. All rights reserved.
Face parsing refers to the labeling of each facial component in a face image and has been employed in facial stimulation, expression recognition, and makeup use, effectively providing a basis for further analysis, com...
详细信息
Face parsing refers to the labeling of each facial component in a face image and has been employed in facial stimulation, expression recognition, and makeup use, effectively providing a basis for further analysis, computations, animation, modification, and numerous other applications. Although existing face parsing methods have demonstrated good performance, they fail to extract rich features and recover accurate segmentation maps, particularly for faces with high variations in expression and sufficiently similar appearances. Moreover, these approaches neglect the semantic gaps and dependencies between facial categories and their boundaries. To address these drawbacks, we propose an efficient dilated convolution network with different aspect ratios to attain accurate face parsing of the output by applying the feature extraction capability. The proposed network-structured multiscale dilated encoder-decoder convolution model obtains rich component information and efficiently improves the capture of global information by obtaining low- and high-level semantic features. To achieve a delicate parsing output of the face components along the borders and analyze the connections between the face categories and their border edges, the semantic edge map is learned using a conditional random field, which aims to distinguish border and non-border pixels during the modeling. We conducted experiments using three well-known publicly available face databases. The recorded results demonstrate the high accuracy and capacity of the proposed method in comparison to previous state-of-art methods. Our proposed model achieved a mean accuracy of 90% on the CelebAMask-HQdataset for the category case and 81.43% for the accessory case, and achieved accuracies of 91.58% and 92.44% on the HELEN and LaPa datasets, respectively, thereby demonstrating its effectiveness. (C) 2022 The Author(s). Published by Elsevier B.V.
暂无评论