版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:School of Optical-Electrical and Computer Engineering University of Shanghai for Science and Technology Shanghai200093 China College of Science University of Shanghai for Science and Technology Shanghai200093 China Shanghai Key Laboratory of Modern Optical System Shanghai200093 China Key Laboratory of Biomedical Optical Technology and Devices of Ministry of Education Shanghai200093 China Shanghai Institute of Intelligent Science and Technology Tongji University Shanghai201210 China
出 版 物:《SSRN》
年 卷 期:2024年
核心收录:
摘 要:We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion (M2IF) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on three benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://***/LFUSST/MMI-VG. © 2024, The Authors. All rights reserved.