检索结果-内蒙古大学图书馆

ReAL: Improving image-Text Retrieval with Authentic Negative Repository Learning

ACM Transactions on Multimedia Computing, communications, and Applications 1000年

作者： Renjie Pan Hua Yang Xiangyu Zhao Institute of Image Communication and Network Engineering Shanghai Key Lab of Digital Media Processing and Transmission Shanghai Jiao Tong University China Institute of Image Communication and Network Engineering Shanghai Jiao Tong University China

Current methods for image-text retrieval commonly propose various fusion modules to achieve robust visual-textual alignment, primarily relying on in-batch learning to guide the matching process. Some follow-up methods seek to enlarge the number of negative samples to boost image-text contrastive learning. However, these methods often face challenges posed by semantic-consistent negatives, i.e., negatives samples that share correspondence with the ground truth, leading to confusion in learning cross-modal semantics. To address this issue, we propose a novel Retrieve with Authentic negative repository Learning (ReAL) method, which constructs a specific Authentic Negative Repository filled with valuable negative sample pairs. By introducing a Unique Negative Filter with a Discriminative Triplet Ranking Loss, ReAL effectively filters out the semantic-consistent negatives through similarity distribution analysis and threshold learning. Moreover, existing fusion paradigms suffer from intricate use of fine-grained representations from word- and region-level instances to progressively refine the fused embedding. In this paper, we propose a lightweight Cluster Refinement Module to exploit cross-modal semantics in a 1-way-1-out paradigm. Each visual-textual alignment can spontaneously uncover correlations with adjacent alignments through aggregation and re-allocation, without the need for a redundant and cost-inefficient refinement stage. Furthermore, ReAL employs dual momentum encoders with two memory banks, expanding the selection range of the Authentic Negative Repository to include a broader set of negatives. Extensive experiments conducted on Flickr30K, MS-COCO, and the augmented Flickr30K (with more hard negatives) demonstrate the superiority and robustness of ReAL, while also showcasing its significantly reduced inference time compared to other competitive baselines.

关键词： image-text Retrieval Authentic Negative Repository Cross-modal Fusion

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：