检索结果-内蒙古大学图书馆

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

引用

NEURAL NETWORKS 2023年 162卷 318-329页

作者： Wang, Qi Deng, Hongyu Wu, Xue Yang, Zhenguo Liu, Yun Wang, Yazhou Hao, Gefei Guizhou Univ Coll Comp Sci & Technol State Key Lab Publ Big Data Guizhou Peoples R China Guizhou Univ Ctr Res & Dev Fine Chem Guiyang Peoples R China Guangdong Univ Technol Sch Comp Guangdong Peoples R China Moutai Inst Dept Automation Nanjing Peoples R China Southeast Univ Sch Microelect Nanjing 210096 Peoples R China

text-based image captioning (textCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the textCap task, named textLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the textCaps dataset demonstrate the effectiveness of our method. Code is available at https://***/DengHY258/LCM-Captioner. (c) 2023 Elsevier Ltd. All rights reserved.

关键词： text-based image captioning Multimodal information Lightweight network Collaborative attention mechanism Feature transformation

来源：评论

学校读者我要写书评

暂无评论

EAES: Effective Augmented Embedding Spaces for text-based image captioning

引用

IEEE ACCESS 2022年 10卷 32443-32452页

作者： Khang Nguyen Bui, Doanh C. Truc Trinh Vo, Nguyen D. Vietnam Natl Univ Ho Chi Minh City VNUHCM Univ Informat Technol Ho Chi Minh City 7000 Vietnam Vietnam Natl Univ Ho Chi Minh City VNUHCM Ho Chi Minh City 700000 Vietnam

text-based image captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid features augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid features for an image with the Grid features augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the textCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. Source code is available at https://***/UIT-Together/EAES_m4c.

关键词： Optical character recognition software Visualization Feature extraction Adaptation models Transformers Training Semantics image captioning text-based image captioning bottom-up top-down grid feature multimodal transformer m4c

来源：评论

学校读者我要写书评

暂无评论

Zero-textCap: Zero-shot Framework for text-based image captioning 23

Zero-TextCap: Zero-shot Framework for Text-based Image Capti...

引用

31st ACM International Conference on Multimedia (MM)

作者： Xu, Dongsheng Zhao, Wenye Cai, Yi Huang, Qingbao Guangxi Univ Sch Elect Engn Nanning Guangxi Peoples R China South China Univ Technol Sch Software Engn Guangzhou Guangdong Peoples R China MOE China Key Lab Big Data & Intelligent Robot SCUT Guangzhou Guangdong Peoples R China Guangxi Key Lab Multimedia Commun & Network Techn Nanning Guangxi Peoples R China

ISBN: (纸本)9798400701085

text-based image captioning is a vital but under-explored task, which aims to describe images by captions containing scene text automatically. Recent studies have made encouraging progress, but they are still suffering from two issues. Firstly, current models cannot capture and generate scene text in non-Latin script languages, which severely limits the objectivity and the information completeness of generated captions. Secondly, current models tend to describe images with monotonous and templated style, which greatly limits the diversity of the generated captions. Although the above-mentioned issues can be alleviated through carefully designed annotations, this process is undoubtedly laborious and time-consuming. To address the above issues, we propose a Zero-shot Framework for text-based image captioning (Zero-textCap). Concretely, to generate candidate sentences starting from the prompt 'image of' and iteratively refine them to improve the quality and diversity of captions, we introduce a Hybrid-sampling masked language model (H-MLM). To read multi-lingual scene text and model the relationships between them, we introduce a robust OCR system. To ensure that the captions generated by H-MLM contain scene text and are highly relevant to the image, we propose a CLIP-based generation guidance module to insert OCR tokens and filter candidate sentences. Our Zero-textCap is capable of generalizing captions containing multi-lingual scene text and boosting the diversity of captions. Sufficient experiments demonstrate the effectiveness of our proposed Zero-textCap. Our codes are available at https://***/VILAN-Lab/Zero_textCap.

关键词： text-based image captioning Zero-shot Language bias Diversity

来源：评论

学校读者我要写书评

暂无评论

Exploring coherence from heterogeneous representations for OCR image captioning

引用

MULTIMEDIA SYSTEMS 2024年第5期30卷 1-13页

作者： Zhang, Yao Song, Zijie Hu, Zhenzhen Hefei Univ Technol Sch Comp Sci & Informat Engn Hefei Peoples R China

text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the textCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.

关键词： text-based image captioning Heterogeneous attention Scene text textCaps

来源：评论

学校读者我要写书评

暂无评论

Relation-Aware Global-Augmented Transformer for textCaps 31st

Relation-Aware Global-Augmented Transformer for TextCaps

引用

31st International Conference on Artificial Neural Networks (ICANN)

作者： Li, Qiang Li, Bing Ma, Can Chinese Acad Sci Inst Informat Engn Beijing Peoples R China Univ Chinese Acad Sci Sch Cyber Secur Beijing Peoples R China

ISBN: (纸本)9783031159190;9783031159183

text-based image captioning (textCaps) task aims to describe the given image reasonably based on scene text and visual objects simultaneously. Although previous works have shown great success, they pay too much attention to the text modality while ignoring other important visual information, and the correlations between objects and text are not fully exploited. Moreover, traditional transformer-based architectures ignore global information reflecting the entire image, which may cause object missing and erroneous reasoning problems. In this paper, we propose a Relation-aware Global-augmented Transformer (RGT) framework to tackle these problems. Specifically, we utilize a scene graph extracted from the image to explicitly model the relative semantic and spatial relationships of objects via a graph convolutional network, which not only enhances the visual representations but also encodes explicit semantic features of objects. Besides, we add a multi-modal alignment (MMA) module as a supplement for the multi-modal transformer to strengthen the association between scene text and objects. Finally, a global-augmented transformer (GAT) is designed to get a more comprehensive representation of the image, which could alleviate object missing and erroneous reasoning problems. Our method outperforms state-of-the-art models on the textCaps dataset, improving from 105.0 to 107.2 in CIDEr.

关键词： text-based image captioning Scene graph Multi-modal alignment Global-augmented transformer

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：