Semantic segmentation, as a pixel-level recognition task, has been widely used in a variety of practical scenes. Most of the existing methods try to improve the performance of the network by fusing the information of ...
详细信息
Semantic segmentation, as a pixel-level recognition task, has been widely used in a variety of practical scenes. Most of the existing methods try to improve the performance of the network by fusing the information of high and low layers. This kind of simple concatenation or element-wise addition will lead to the problem of unbalanced fusion and low utilization of inter-level features. To solve this problem, we propose the Inter-Level Feature Balanced Fusion Network (IFBFNet) to guide the inter-level feature fusion towards a more balanced and effective direction. Our overall network architecture is based on the encoder-decoder architecture. In the encoder, we use a relatively deep convolution network to extract rich semantic information. In the decoder, skip-connections are added to connect and fuse low-level spatial features to restore a clearer boundary expression gradually. We add an inter-level feature balanced fusion module to each skip connection. Additionally, to better capture the boundary information, we added a shallower spatial information stream to supplement more spatial information details. Experiments have proved the effectiveness of our module. Our IFBFNet achieved a competitive performance on the Cityscapes dataset with only finely annotated data used for training and has been greatly improved on the baseline network.
Generating natural questions from an image is a semantic task that requires using vision and language modalities to learn multimodal representations. Images can have multiple visual and language cues such as places, c...
详细信息
Generating natural questions from an image is a semantic task that requires using vision and language modalities to learn multimodal representations. Images can have multiple visual and language cues such as places, captions, and tags. In this paper, we propose a principled deep Bayesian learning framework that combines these cues to produce natural questions. We observe that with the addition of more cues and by minimizing uncertainty in the among cues, the Bayesian network becomes more confident. We propose a Minimizing Uncertainty of Mixture of Cues (MUMC), that minimizes uncertainty present in a mixture of cues experts for generating probabilistic questions. This is a Bayesian framework and the results show a remarkable similarity to natural questions as validated by a human study. Ablation studies of our model indicate that a subset of cues is inferior at this task and hence the principled fusion of cues is preferred. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU-n, METEOR, ROUGE, and CIDEr). Here, we provide project link for Deep Bayesian VQG: https://***/BVQG/. (c) 2021 Elsevier B.V. All rights reserved.
Maritime safety is an important issue for global shipping industries. Currently, most of collision accidents at sea are caused by the misjudgement of the ship's operators. The deployment of maritime autonomous sur...
详细信息
Maritime safety is an important issue for global shipping industries. Currently, most of collision accidents at sea are caused by the misjudgement of the ship's operators. The deployment of maritime autonomous surface ships (MASS) can greatly reduce ships' reliance on human operators by using an automated intelligent collision avoidance system to replace human decision-making. To successfully develop such a system, the capability of autonomously identifying other ships and evaluating their associated encountering situation is of paramount importance. In this paper, we aim to identify ships' encounter situation modes using deep learning methods based upon the Automatic Identification System (AIS) data. First, a segmentation process is developed to divide each ship's AIS data into different segments that contain only one encounter situation mode. This is different to the majority of studies that have proposed encounter situation mode classification using hand-crafted features, which may not reflect the actual ship's movement states. Furthermore, a number of present classification tasks are conducted using substantial labelled AIS data followed by a supervised training paradigm, which is not applicable to our dataset as it contains a large number of unlabelled AIS data. Therefore, a method called Semi Supervised Convolutional encoder-decoder Network (SCEDN) for ship encounter situation classification based on AIS data is proposed. The structure of the network is not only able to automatically extract features from AIS segments but also share training parameters for the unlabelled data. The SCEDN uses an encoder-decoder convolutional structure with four channels for each segment (distance, speed, Time to the Closed Point of Approach (TCPA) and Distance to the Closed Point of Approach (DCPA)) been developed. The performance of the SCEDN model are evaluated by comparing to several baselines with the experimental results demonstrating a higher accuracy can be achieved by o
In this paper, we propose a novel stroke constrained attention network (SCAN) which treats stroke as the basic unit for encoder-decoder based online handwritten mathematical expression recognition (HMER). Unlike previ...
详细信息
In this paper, we propose a novel stroke constrained attention network (SCAN) which treats stroke as the basic unit for encoder-decoder based online handwritten mathematical expression recognition (HMER). Unlike previous methods which use trace points or image pixels as basic units, SCAN makes full use of stroke-level information for better alignment and representation. The proposed SCAN can be adopted in both single-modal (online or offline) and multi-modal HMER. For single-modal HMER, SCAN first employs a CNN-GRU encoder to extract point-level features from input traces in online mode and employs a CNN encoder to extract pixel-level features from input images in offline mode, then use stroke constrained information to convert them into online and offline stroke-level features. Using stroke-level features can explicitly group points or pixels belonging to the same stroke, therefore reduces the difficulty of symbol segmentation and recognition via the decoder with attention mechanism. For multi-modal HMER, other than fusing multi-modal information in decoder, SCAN can also fuse multi-modal information in encoder by utilizing the stroke based alignments between online and offline modalities. The encoder fusion is a better way for combining multi-modal information as it implements the information interaction one step before the decoder fusion so that the advantages of multiple modalities can be exploited earlier and more adequately. Besides, we propose an approach combining the encoder fusion and decoder fusion, namely encoder-decoder fusion, which can further improve the performance. Evaluated on a benchmark published by CROHME competition, the proposed SCAN achieves the state-of-the-art performance. Furthermore, by conducting experiments on an additional task: online handwritten Chinese character recognition (HCCR), we demonstrate the generality of our proposed method. (c) 2021 Elsevier Ltd. All rights reserved.
Since the results of CNN methods for monocular depth estimation generally suffer the problem of visual dissatisfaction, we propose Feature Fusion GAN (FF-GAN) to address this issue. First, an end-to-end network based ...
详细信息
ISBN:
(纸本)9783030606329;9783030606336
Since the results of CNN methods for monocular depth estimation generally suffer the problem of visual dissatisfaction, we propose Feature Fusion GAN (FF-GAN) to address this issue. First, an end-to-end network based on encoder-decoder structure is proposed as the generator of FF-GAN, which can exploit the information of different scales. The encoder of our generator fuse features in different levels with a feature fusion module. The component which can obtain the information of multi-scale receptive field is the main part of the decoder of our generator. Second, in order to match the generator, the discriminator of FF-GANis designed to efficiently learn the information of different scales by applying pyramid structure. Experiments on public datasets demonstrate the effectiveness of our generator and discriminator. Compared with the CNN methods, the results predicted by FF-GAN are significantly improved in terms of texture loss and edge blur while ensuring accuracy, and the visual effect is better.
Breast cancer diagnosis is based on radiology reports describing observations made from medical imagery, such as X-rays obtained during mammography. The reports are written by radiologists and contain a conclusion sum...
详细信息
ISBN:
(纸本)9781728141442
Breast cancer diagnosis is based on radiology reports describing observations made from medical imagery, such as X-rays obtained during mammography. The reports are written by radiologists and contain a conclusion summarizing the observations. Manually summarizing the reports is time-consuming and leads to high text variability. This paper investigates the automated summarization of Dutch radiology reports. We propose a hybrid model consisting of a language model (encoder-decoder with attention) and a separate BI-RADS score classifier. The summarization model achieved a ROUGE-L F1 score of 51.5% on the Dutch reports, which is comparable to results in other languages and other domains. For the BI-RADS classification, the language model (accuracy 79.1%) was outperformed by a separate classifier (accuracy 83.3%), leading us to propose a hybrid approach for radiology report summarization. Our qualitative evaluation with experts found the generated conclusions to be comprehensible and to cover mostly relevant content, and the main focus for improvement should be their factual correctness. While the current model is not accurate enough to be employed in clinical practice, our results indicate that hybrid models might be a worthwhile direction for future research.
Video object segmentation, aiming to segment the foreground objects given the annotation of the first frame, has been attracting increasing attentions. Many state-of-the-art approaches have achieved great performance ...
详细信息
Video object segmentation, aiming to segment the foreground objects given the annotation of the first frame, has been attracting increasing attentions. Many state-of-the-art approaches have achieved great performance by relying on online model updating or mask-propagation techniques. However, most online models require high computational cost due to model fine-tuning during inference. Most mask-propagation based models are faster but with relatively low performance due to failure to adapt to object appearance variation. In this paper, we are aiming to design a new model to make a good balance between speed and performance. We propose a model, called NPMCA-net, which directly localizes foreground objects based on mask-propagation and non-local technique by matching pixels in reference and target frames. Since we bring in information of both first and previous frames, our network is robust to large object appearance variation, and can better adapt to occlusions. Extensive experiments show that our approach can achieve a new state-of-the-art performance with a fast speed at the same time (86.5% IoU on DAVIS-2016 and 72.2% IoU on DAVIS-2017, with speed of 0.11s per frame) under the same level comparison. Source code is available at https://***/siyueyu/NPMCA-net.
Neural encoder-decoder architectures have been used extensively for image captioning. Con-volutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are popularly used in encoder and decoder models. Recurren...
详细信息
Neural encoder-decoder architectures have been used extensively for image captioning. Con-volutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are popularly used in encoder and decoder models. Recurrent Neural Networks are popular architectures in natural language processing used for language modeling, but they are sequential in nature. The transformer model can solve this problem of sequential dependency by using an attention mechanism. Many works are available for image captioning in the English language, but models for generating Hindi captions are limited;hence, we have tried to fill this gap. We have created the Hindi dataset for image captioning by manually translating the popular MSCOCO dataset from English to Hindi. Experimental results show that our proposed model outperforms other models. The proposed model has attained the BLEU-1 score of 62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of 19.0.
E-Bibliotherapy deals with adolescent psychological stress by manually or automatically recommending multiple reading articles around their stressful events, using electronic devices as a medium. To make E-Bibliothera...
详细信息
E-Bibliotherapy deals with adolescent psychological stress by manually or automatically recommending multiple reading articles around their stressful events, using electronic devices as a medium. To make E-Bibliotherapy really useful, generating instructive questions before their reading is an important step. Such a question shall (a) attract teens' attention;(b) convey the essential message of the reading materials so as to improve teens' active comprehension;and most importantly (c) highlight teens' stress to enable them to generate emotional resonance and thus willingness to pursue the reading. Therefore in this paper, we propose to generate instructive questions from the multiple recommended articles to guide teens to read. Four solutions based on the neural encoder-decoder model are presented to tackle the task. For model training and testing, we construct a novel large-scale QA dataset named TeenQA, which is specific to adolescent stress. Due to the extensibility of question expressions, we incorporate three groups of automatic evaluation metrics as well as one group of human evaluation metrics to examine the quality of the generated questions. The experimental results show that the proposed encoder-decoder with Summary on Contexts with Feature-rich embeddings (ED-SoCF) solution can generate good questions for guiding reading, achieving comparable performance on some semantic similarity metrics with that of humans.
We investigate the effect of data augmentation on low-resource morphological segmentation. We compare two settings: the pure low-resource one, when only 100 annotated word forms are available, and the augmented one, w...
详细信息
ISBN:
(纸本)9791095546344
We investigate the effect of data augmentation on low-resource morphological segmentation. We compare two settings: the pure low-resource one, when only 100 annotated word forms are available, and the augmented one, where we use the original training set and 1000 unlabeled word forms to generate 1000 artificial inflected forms. Evaluating on Sigmorphon 2018 dataset, we observe that using the best among these two models reduces the error rate of state-of-the-art model by 6%, while for our baseline model the error reduction is 17%
暂无评论