Current methods for image-text retrieval commonly propose various fusion modules to achieve robust visual-textual alignment, primarily relying on in-batch learning to guide the matching process. Some follow-up methods...
详细信息
Current methods for image-text retrieval commonly propose various fusion modules to achieve robust visual-textual alignment, primarily relying on in-batch learning to guide the matching process. Some follow-up methods seek to enlarge the number of negative samples to boost image-text contrastive learning. However, these methods often face challenges posed by semantic-consistent negatives, i.e., negatives samples that share correspondence with the ground truth, leading to confusion in learning cross-modal semantics. To address this issue, we propose a novel Retrieve with Authentic negative repository Learning (ReAL) method, which constructs a specific Authentic Negative Repository filled with valuable negative sample pairs. By introducing a Unique Negative Filter with a Discriminative Triplet Ranking Loss, ReAL effectively filters out the semantic-consistent negatives through similarity distribution analysis and threshold learning. Moreover, existing fusion paradigms suffer from intricate use of fine-grained representations from word- and region-level instances to progressively refine the fused embedding. In this paper, we propose a lightweight Cluster Refinement Module to exploit cross-modal semantics in a 1-way-1-out paradigm. Each visual-textual alignment can spontaneously uncover correlations with adjacent alignments through aggregation and re-allocation, without the need for a redundant and cost-inefficient refinement stage. Furthermore, ReAL employs dual momentum encoders with two memory banks, expanding the selection range of the Authentic Negative Repository to include a broader set of negatives. Extensive experiments conducted on Flickr30K, MS-COCO, and the augmented Flickr30K (with more hard negatives) demonstrate the superiority and robustness of ReAL, while also showcasing its significantly reduced inference time compared to other competitive baselines.
暂无评论