Recently, the stereo matching task has been dramatically promoted by the deep learning methods. Specifically, the encoder-decoder framework with skip connection achieves outstanding performance over others. The skip c...
详细信息
Recently, the stereo matching task has been dramatically promoted by the deep learning methods. Specifically, the encoder-decoder framework with skip connection achieves outstanding performance over others. The skip connection scheme can bring detailed or in other words, residual information for the final prediction, thus improves the performance, which is successfully applied in many other pixel-wise prediction tasks, such as semantic segmentation, depth estimation and so on. In contrast to other tasks, the authors can explicitly obtain the residual information for stereo matching, which is achieved by back-warping the right image and calculating the reconstruction error. The reconstruction error is successfully used as unsupervised loss, but has not been explored for skip connection. In this Letter, the authors show that the reconstruction error in the feature space is very helpful to bring residual information for the final prediction. They validate the effectiveness of using reconstruction error for skip connection by conducting experiments on the KITTI 2015 and Scene Flow datasets. Experiments show that the proposed scheme can improve the performance by a notable margin and achieves the state-of-the-art performance with very fast processing time.
Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of i...
详细信息
Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented. (C) 2018 Elsevier B.V. All rights reserved.
This paper studies the couplet generation model which automatically generates the second line of a couplet by giving the first line. Unlike other sequence generation problems, couplet generation not only considers the...
详细信息
This paper studies the couplet generation model which automatically generates the second line of a couplet by giving the first line. Unlike other sequence generation problems, couplet generation not only considers the sequential context within a sentence line but also emphasizes the relationships between the corresponding words of first and second lines. Therefore, a trapezoidal context character embedding the vector model has been developed firstly, which considers the 'sequence context' and the 'corresponding word context' simultaneously. Afterwards, we chose the typical encoder-decoder framework to solve the sequence-sequence problems, of which the encoder and decoder are used by bi-directional GRU and GRU, respectively. In order to further increase the semantic consistency of the first and second lines of couplets, the pre-trained sentence vector of the first line is added to the attention mechanism in the model. To verify the effectiveness of the method, it is applied to the real data set. Experimental results show that our proposed model can compete with the up-to-date methods, and both adding sentence vectors to attention and using trapezoidal context character vectors can improve the effectiveness of the algorithm.
Efficient RGB-D semantic segmentation has received considerable attention in mobile robots, which plays a vital role in analyzing and recognizing environmental information. According to previous studies, depth informa...
详细信息
Efficient RGB-D semantic segmentation has received considerable attention in mobile robots, which plays a vital role in analyzing and recognizing environmental information. According to previous studies, depth information can provide corresponding geometric relationships for objects and scenes, but actual depth data usually exist as noise. To avoid unfavorable effects on segmentation accuracy and computation, it is necessary to design an efficient framework to leverage cross-modal correlations and complementary cues. In this article, we propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm. Working with channel and spatial fusion attention modules, our network effectively captures multi-level RGB-D features. A globally guided local affinity context module is proposed to obtain sufficient high-level context information. The decoder uses a lightweight residual unit (LRU) that combines short- and long-distance information with a few redundant computations. Experimental results on the NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better tradeoff among segmentation accuracy, inference time, and parameters than the state-of-the-art (SOTA) methods.
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains g...
详细信息
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption's quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://***/mdbipu/GVA).
With the increasing availability of medical images coming from different modalities (X-Ray, CT, PET, MRI, ultrasound, etc.), and the huge advances in the development of incredibly fast, accurate and enhanced computing...
详细信息
ISBN:
(纸本)9781450365628
With the increasing availability of medical images coming from different modalities (X-Ray, CT, PET, MRI, ultrasound, etc.), and the huge advances in the development of incredibly fast, accurate and enhanced computing power with the current graphics processing units. The task of automatic caption generation from medical images became a new way to improve healthcare and the key method for getting better results at lower costs. In this paper, we give a comprehensive overview of the task of image captioning in the medical domain, covering: existing models, the benchmark medical image caption datasets, and evaluation metrics that have been used to measure the quality of the generated captions.
Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowl...
详细信息
Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times faster(1).
Numerical reasoning over hybrid data aims to extract critical facts from long-form documents and tables, and generate arithmetic expressions based on these facts to answer the question. Most existing methods are based...
详细信息
ISBN:
(纸本)9783031636455;9783031636462
Numerical reasoning over hybrid data aims to extract critical facts from long-form documents and tables, and generate arithmetic expressions based on these facts to answer the question. Most existing methods are based on the retriever-generator model. However, the inferential power of the retriever-generator model is poor, resulting in insufficient attention to critical facts. To solve these problems, combining Large Language Model (LLM) and Case-Based Reasoning (CBR), we propose a Case-Based Driven Retriever-generator model (CBR-Ren) to enhance the ability of the retriever-generator model for retrieving and distinguishing critical facts. In the retrieval stage, the model introduces a golden explanation by prompt technology of LLM, which helps the retriever construct explicit templates for inferring critical facts and reduces the impact of non-critical facts on the generator. In the generator stage, the CBR-driven retrieval algorithm enhances the representation learning ability of the encoder and obtains the relevant knowledge in decoder history. In addition, the model proposes fact weighting, which enhances the ability to locate critical facts and helps to generate correct numerical expressions. Experimental results on the FinQA and Conv-FinQA demonstrate the effectiveness of CBR-Ren, which outperforms all the baselines.
In recent years, the encoder-decoder framework has been widely used in image captioning. In the forecast period, many methods regard the input of the usage model at the previous moment as the output at the moment, whi...
详细信息
ISBN:
(纸本)9781450366007
In recent years, the encoder-decoder framework has been widely used in image captioning. In the forecast period, many methods regard the input of the usage model at the previous moment as the output at the moment, which may cause the generated words to get worse. This paper proposes to use the correct rate of the preceding words to constrain the weight of the back words, making the loss weight of the back words increase as the preceding word error rate decreases, namely Automatic Constraint Loss (ACL), reducing the difference in the training and test phase. The experimental results on the MSCOCO dataset show that the addition of the proposed method to the original model, the bleu_1 and bleu_2 scores are greatly improved, and the attention mechanism can more accurately select the image region.
Due to the difficulty of abstractive summarization, the great majority of past work on document summarization has been extractive, while the recent success of sequence-to-sequence framework has made abstractive summar...
详细信息
ISBN:
(纸本)9783319736181;9783319736174
Due to the difficulty of abstractive summarization, the great majority of past work on document summarization has been extractive, while the recent success of sequence-to-sequence framework has made abstractive summarization viable, in which a set of recurrent neural networks models based on attention encoder-decoder have achieved promising performance on short-text summarization tasks. Unfortunately, these attention encoder-decoder models often suffer from the undesirable shortcomings of generating repeated words or phrases and inability to deal with out-of-vocabulary words appropriately. To address these issues, in this work we propose to add an attention mechanism on output sequence to avoid repetitive contents and use the subword method to deal with the rare and unknown words. We applied our model to the public dataset provided by NLPCC 2017 shared task3. The evaluation results show that our system achieved the best ROUGE performance among all the participating teams and is also competitive with some state-of-the-art methods.
暂无评论