Cross-modal retrieval is a natural and highly valuable need in the current multimedia content explosion era. This paper addresses the problem of unsupervised cross-modal hashing retrieval which enables efficient retri...
详细信息
Cross-modal retrieval is a natural and highly valuable need in the current multimedia content explosion era. This paper addresses the problem of unsupervised cross-modal hashing retrieval which enables efficient retrieval across different modalities (e.g., image-text) without class labels. Most previous methods try to align visual and text binary representations in the joint Hamming space, by independently learning encoding functions for respective modality domains. However, since the paired training data describes the same object from different modalities, one modality data exactly plays a complementary role in learning encoding function for the other modality data, which has been less explored. This paper presents a novel cross-modal retrieval framework, called deep dual variational hashing (DDVH), by exploring dual variational mappings between modalities to bridge the inherent modality gap. Specifically, DDVH consists of two sub-modules, which are visual variational mapping (VVM) and textual variational mapping (TVM). VVM generates semantic-preserved binary codes for visual modality samples via the Gaussian latent embeddings, and TVM learns visual-guided binary codes for the corresponding text modality data. These two sub-modules can be jointly optimized under the cyclic consistency mechanism. Such a dual variational mapping strategy enables DDVH to generate unified binary representations for two modalities by visual-semantic interaction in the Hamming space. Comprehensive experiments on three benchmarks demonstrate that our proposed DDVH approach yields significant improvements compared to the state-of-the-art methods.
Tremendous progresses have been made in remote sensing image captioning (RSIC) task in recent years, yet there still some unresolved problems: (1) facing the gap between the visual features and semantic concepts, (2) ...
详细信息
Tremendous progresses have been made in remote sensing image captioning (RSIC) task in recent years, yet there still some unresolved problems: (1) facing the gap between the visual features and semantic concepts, (2) reasoning the higher-level relationships between semantic concepts. In this work, we focus on injecting high-level visual-semantic interaction into RSIC model. Firstly, the semantic concept extractor (SCE), end-to end trainable, precisely captures the semantic concepts contained in the RSIs. In particular, the visual-semantic co-attention (VSCA) is designed to grain coarse concept-related regions and region-related concepts for multi modal interaction. Furthermore, we incorporate the two types of attentive vectors with semantic-level relational features into a consensus exploitation (CE) block for learning cross-modal consensus-aware knowledge. The experiments on three benchmark data sets show the superiority of our approach compared with the reference methods.
暂无评论