版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Vietnam Natl Univ Ho Chi Minh City VNUHCM Univ Informat Technol Ho Chi Minh City 7000 Vietnam Vietnam Natl Univ Ho Chi Minh City VNUHCM Ho Chi Minh City 700000 Vietnam
出 版 物:《IEEE ACCESS》 (IEEE Access)
年 卷 期:2022年第10卷
页 面:32443-32452页
核心收录:
基 金:VNUHCM-University of Information Technology's Scientific Research Support Fund
主 题:Optical character recognition software Visualization Feature extraction Adaptation models Transformers Training Semantics Image captioning text-based image captioning bottom-up top-down grid feature multimodal transformer m4c
摘 要:Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid features augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid features for an image with the Grid features augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the TextCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. Source code is available at https://***/UIT-Together/EAES_m4c.