Fine-grained vision-language retrieval aims to search for corresponding fine-grained images based on a text query, or vice versa. The challenge lies in how to match cross-modal data by learning an effective alignment....
详细信息
ISBN:
(纸本)9789819786190;9789819786206
Fine-grained vision-language retrieval aims to search for corresponding fine-grained images based on a text query, or vice versa. The challenge lies in how to match cross-modal data by learning an effective alignment. This paper proposes a simple yet effective efficiency-aware fine-grained vision-language retrieval via a global-contextual auto-encoder method. Firstly, global-contextual features from the images and texts are learned to promote the discriminability of the intra-modality features. Then, to strengthen the semantic relevance among heterogeneous modalities, this method employs a semantic autoencoder. Concretely, the encoder projects the visual features into the semantic space occupied by the textual features. Further, the decoder applies an additional constraint, which is desirable to reconstruct the original visual features. Notably, the autoencoder is linear and symmetric, making it reasonable to scale up on large datasets. Comprehensive experiments on two fine-grained tasks illustrate that the proposed method surpasses several state-of-the-art baselines, validating its effectiveness and efficiency.
暂无评论