Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of i...
详细信息
Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented. (C) 2018 Elsevier B.V. All rights reserved.
Recently, the stereo matching task has been dramatically promoted by the deep learning methods. Specifically, the encoder-decoder framework with skip connection achieves outstanding performance over others. The skip c...
详细信息
Recently, the stereo matching task has been dramatically promoted by the deep learning methods. Specifically, the encoder-decoder framework with skip connection achieves outstanding performance over others. The skip connection scheme can bring detailed or in other words, residual information for the final prediction, thus improves the performance, which is successfully applied in many other pixel-wise prediction tasks, such as semantic segmentation, depth estimation and so on. In contrast to other tasks, the authors can explicitly obtain the residual information for stereo matching, which is achieved by back-warping the right image and calculating the reconstruction error. The reconstruction error is successfully used as unsupervised loss, but has not been explored for skip connection. In this Letter, the authors show that the reconstruction error in the feature space is very helpful to bring residual information for the final prediction. They validate the effectiveness of using reconstruction error for skip connection by conducting experiments on the KITTI 2015 and Scene Flow datasets. Experiments show that the proposed scheme can improve the performance by a notable margin and achieves the state-of-the-art performance with very fast processing time.
With the increasing availability of medical images coming from different modalities (X-Ray, CT, PET, MRI, ultrasound, etc.), and the huge advances in the development of incredibly fast, accurate and enhanced computing...
详细信息
ISBN:
(纸本)9781450365628
With the increasing availability of medical images coming from different modalities (X-Ray, CT, PET, MRI, ultrasound, etc.), and the huge advances in the development of incredibly fast, accurate and enhanced computing power with the current graphics processing units. The task of automatic caption generation from medical images became a new way to improve healthcare and the key method for getting better results at lower costs. In this paper, we give a comprehensive overview of the task of image captioning in the medical domain, covering: existing models, the benchmark medical image caption datasets, and evaluation metrics that have been used to measure the quality of the generated captions.
Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by a convolutional neural...
详细信息
ISBN:
(纸本)9783030012168;9783030012151
Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by a convolutional neural network (CNN) and then translated into natural language with a recurrent neural network (RNN). The existing models counting on this framework employ only one kind of CNNs, e.g., ResNet or Inception-X, which describes the image contents from only one specific view point. Thus, the semantic meaning of the input image cannot be comprehensively understood, which restricts improving the performance. In this paper, to exploit the complementary information from multiple encoders, we propose a novel recurrent fusion network (RFNet) for the image captioning task. The fusion process in our model can exploit the interactions among the outputs of the image encoders and generate new compact and informative representations for the decoder. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RFNet, which sets a new state-of-the-art for image captioning.
This paper introduces an alternative approach on embedding emotional information at the encoder stage of a sequence-to-sequence based emotional response generation. It explores different positioning and styles of the ...
详细信息
ISBN:
(纸本)9781538669877
This paper introduces an alternative approach on embedding emotional information at the encoder stage of a sequence-to-sequence based emotional response generation. It explores different positioning and styles of the embedding, which represent associations of emotion with specific words or the whole sentence. The experiment was set up with standard dataset as well as dataset annotated with emotional classifiers. Preliminary results showed that this new approach should better represent sentence level emotional and work well with standard Recurrent Neural network (RNN) with Long Short Term Memory (LSTM) architecture.
Due to the difficulty of abstractive summarization, the great majority of past work on document summarization has been extractive, while the recent success of sequence-to-sequence framework has made abstractive summar...
详细信息
ISBN:
(纸本)9783319736181;9783319736174
Due to the difficulty of abstractive summarization, the great majority of past work on document summarization has been extractive, while the recent success of sequence-to-sequence framework has made abstractive summarization viable, in which a set of recurrent neural networks models based on attention encoder-decoder have achieved promising performance on short-text summarization tasks. Unfortunately, these attention encoder-decoder models often suffer from the undesirable shortcomings of generating repeated words or phrases and inability to deal with out-of-vocabulary words appropriately. To address these issues, in this work we propose to add an attention mechanism on output sequence to avoid repetitive contents and use the subword method to deal with the rare and unknown words. We applied our model to the public dataset provided by NLPCC 2017 shared task3. The evaluation results show that our system achieved the best ROUGE performance among all the participating teams and is also competitive with some state-of-the-art methods.
Generating a natural language description of an image is a challenging but meaningful *** task combines two significant artificial intelligent fields:computer vision and natural language *** task is valuable for many ...
详细信息
Generating a natural language description of an image is a challenging but meaningful *** task combines two significant artificial intelligent fields:computer vision and natural language *** task is valuable for many applications,such as searching images and assisting the people who have visually impaired to view the world,*** approaches adopt an encoder-decoder framework,and some of the future methods are improved on the basis of this *** these methods,image features are extracted by VGG net or other networks,but the feature map will lose important information during the *** this paper,we fusing different kinds of image features extracted by the two networks:VGG19 and Resnet50,and put it into the neural network to *** also add an attention into the a basic neural encoder-decoder model for generating natural sentence descriptions,at each time step,our model will attend to the image feature and pick up the most meaningful parts to generate *** test our model on the benchmark dataset called I APR TC-12,comparing with other methods,we validate our model have state-of-the-art performance.
暂无评论