咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Improving Visual Grounding wit... 收藏
SSRN

Improving Visual Grounding with Multi-Modal Interaction and Auto-Regressive Vertex Generation

作     者:Qin, Xiaofei Li, Fan He, Changxiang Wang, Lin Zhang, Xuedian 

作者机构:School of Optical-Electrical and Computer Engineering University of Shanghai for Science and Technology Shanghai200093 China College of Science University of Shanghai for Science and Technology Shanghai200093 China Shanghai Key Laboratory of Modern Optical System Shanghai200093 China Key Laboratory of Biomedical Optical Technology and Devices of Ministry of Education Shanghai200093 China Shanghai Institute of Intelligent Science and Technology Tongji University Shanghai201210 China 

出 版 物:《SSRN》 

年 卷 期:2024年

核心收录:

主  题:Computer vision 

摘      要:We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion (M2IF) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on three benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://***/LFUSST/MMI-VG. © 2024, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分