Contrastive Language-Image Pre-training (CLIP) models exhibit impressive zero-shot performance across various downstream cross-modal tasks by simply computing the dot product between image and text features. CLIP is p...
详细信息
Contrastive Language-Image Pre-training (CLIP) models exhibit impressive zero-shot performance across various downstream cross-modal tasks by simply computing the dot product between image and text features. CLIP is pre-trained on large-scale image-text pairs using the InfoNCE loss, which maximizes the cosine similarity of positive image-text pairs while minimizing the similarity of negative pairs. However, an objective mismatch exists between the downstream usage and the pre-training phase, as the inference phase fails to exploit information from negative samples. Intuitively, since the CLIP model has been optimized based on the InfoNCE loss, the downstream usage should also be in alignment. In this paper, we start from analyzing the InfoNCE loss and derive its upper bound. Our derivation reveals that the dot-product operation serves a zero-order approximation of this upper bound, while a centralization operation represents a first-order approximation. To address the objective mismatch problem, we propose a novel method, Inference Calibration (IC), which leverages the first-order and second-order moments of data distribution to calibrate features for zero-shot and few-shot scenarios. Experiments on various cross-modal tasks demonstrate the effectiveness of IC in both zero-shot and few-shot scenarios over dot-product operation and other comparative methods.
暂无评论