text-based image captioning (textCap) aims to remedy the shortcomings of existing imagecaptioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images fr...
详细信息
text-based image captioning (textCap) aims to remedy the shortcomings of existing imagecaptioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the textCap task, named textLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the textCaps dataset demonstrate the effectiveness of our method. Code is available at https://***/DengHY258/LCM-Captioner. (c) 2023 Elsevier Ltd. All rights reserved.
text-based image captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image....
详细信息
text-based image captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid features augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid features for an image with the Grid features augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the textCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. Source code is available at https://***/UIT-Together/EAES_m4c.
text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still sufferin...
详细信息
ISBN:
(纸本)9798400701085
text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for text-based image captioning (Zero-textCap). Concretely, to generate candidate sentences starting from the prompt 'image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-textCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-textCap. Our codes are available at https://***/VILAN-Lab/Zero_textCap.
text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. text-basedimage contains both textual and visual information, which is diffi...
详细信息
text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. text-basedimage contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the textCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.
text-based image captioning (textCaps) task aims to describe the given image reasonably based on scene text and visual objects simultaneously. Although previous works have shown great success, they pay too much attent...
详细信息
ISBN:
(纸本)9783031159190;9783031159183
text-based image captioning (textCaps) task aims to describe the given image reasonably based on scene text and visual objects simultaneously. Although previous works have shown great success, they pay too much attention to the text modality while ignoring other important visual information, and the correlations between objects and text are not fully exploited. Moreover, traditional transformer-based architectures ignore global information reflecting the entire image, which may cause object missing and erroneous reasoning problems. In this paper, we propose a Relation-aware Global-augmented Transformer (RGT) framework to tackle these problems. Specifically, we utilize a scene graph extracted from the image to explicitly model the relative semantic and spatial relationships of objects via a graph convolutional network, which not only enhances the visual representations but also encodes explicit semantic features of objects. Besides, we add a multi-modal alignment (MMA) module as a supplement for the multi-modal transformer to strengthen the association between scene text and objects. Finally, a global-augmented transformer (GAT) is designed to get a more comprehensive representation of the image, which could alleviate object missing and erroneous reasoning problems. Our method outperforms state-of-the-art models on the textCaps dataset, improving from 105.0 to 107.2 in CIDEr.
暂无评论