文献详情 >Improving Visual Grounding wit... 收藏

SSRN

Improving Visual Grounding with Multi-Modal Interaction and Auto-Regressive Vertex Generation

作者：Qin, Xiaofei Li, Fan He, Changxiang Wang, Lin Zhang, Xuedian

作者机构：School of Optical-Electrical and Computer Engineering University of Shanghai for Science and Technology Shanghai200093 China College of Science University of Shanghai for Science and Technology Shanghai200093 China Shanghai Key Laboratory of Modern Optical System Shanghai200093 China Key Laboratory of Biomedical Optical Technology and Devices of Ministry of Education Shanghai200093 China Shanghai Institute of Intelligent Science and Technology Tongji University Shanghai201210 China

出版物：《SSRN》

年卷期：2024年

核心收录：

主　　题：Computer vision

摘要：We propose a concise and consistent network focusing on multi-task learning of Referring Expression Comprehension (REC) and Segmentation (RES) within Visual grounding (VG). To simplify the model architecture and achieve parameter sharing, we reconstruct the Visual grounding task as a floating-point coordinate generation problem based on both image and text inputs. Consequently, rather than separately predicting bounding boxes and pixel-level segmentation masks, we represent them uniformly as a sequence of coordinate tokens and output two corner points of bounding boxes and polygon vertices autoregressively. To improve the accuracy of point generation, we introduce a regression-based decoder. Inspired by bilinear interpolation, this decoder can directly predict precise floating-point coordinates, thus avoiding quantization errors. Additionally, we devise a Multi-Modal Interaction Fusion (M2IF) to address the imbalance between visual and language features in the model. This module focuses visual information on regions relevant to textual descriptions while suppressing the influence of irrelevant areas. Based on our model, Visual grounding is realized through a unified network structure. Experiments conducted on three benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg) demonstrate that the proposed unified network outperforms or is on par with many existing task-customized models. Codes are available at https://***/LFUSST/MMI-VG. © 2024, The Authors. All rights reserved.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Improving Visual Grounding with Multi-Modal Interaction and Auto-Regressive Vertex Generation

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Improving Visual Grounding with Multi-Modal Interaction and Auto-Regressive Vertex Generation

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：