Inspired by the remarkable progress achieved by recent Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) take LLMs as their brains, and have achieved surprising results in many downstream tasks by...
详细信息
ISBN:
(纸本)9798350365474
Inspired by the remarkable progress achieved by recent Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) take LLMs as their brains, and have achieved surprising results in many downstream tasks by training on a large amount of task-specific data. However, when faced with complex tasks that require the collaboration of multiple capabilities, existing MLLMs recollect training data and retrain the model, ignoring the systematic utilization of LLMs and their possessed capabilities learned in downstream tasks. Inspired by the way humans tackle complex questions, in this paper, we propose a novel framework called Task Navigator. In our framework, LLMs act as navigators to chart a viable path for solving complex tasks and guide MLLMs through the process step by step. Specifically, LLMs iteratively break down sub-problems and refine them to be more reasonable and answerable, which are subsequently resolved by MLLMs to obtain relevant subanswers, until the LLMs have collected enough information to answer the initial question. Task Navigator provides an effective way to extend MLLMs to tackle complex tasks, thus broadening MLLMs' applicability. To evaluate the performance of the proposed framework, we have curated a carefully designed benchmark called VersaChallenge. Experiments on VersaChallenge demonstrate the effectiveness of our proposed method.
Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-voca...
详细信息
ISBN:
(纸本)9798350301298
Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://***/wusize/ovdet.
Complex relationships of high arity across modality and context dimensions is a critical challenge in the Emotion recognition in Conversation (ERC) task. Yet, previous works tend to encode multimodal and contextual re...
详细信息
ISBN:
(纸本)9798350301298
Complex relationships of high arity across modality and context dimensions is a critical challenge in the Emotion recognition in Conversation (ERC) task. Yet, previous works tend to encode multimodal and contextual relationships in a loosely-coupled manner, which may harm relationship modelling. Recently, Graph Neural Networks (GNN) which show advantages in capturing data relations, offer a new solution for ERC. However, existing GNN-based ERC models fail to address some general limits of GNNs, including assuming pairwise formulation and erasing high-frequency signals, which may be trivial for many applications but crucial for the ERC task. In this paper, we propose a GNN-based model that explores multivariate relationships and captures the varying importance of emotion discrepancy and commonality by valuing multi-frequency signals. We empower GNNs to better capture the inherent relationships among utterances and deliver more sufficient multimodal and contextual modelling. Experimental results show that our proposed method outperforms previous state-of-the-art works on two popular multimodal ERC datasets.
We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of...
详细信息
ISBN:
(纸本)9798350301298
We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.
Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance alg...
详细信息
ISBN:
(纸本)9798350302493
Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Code is available at ***/arpitbansal297/UniversalGuided-Diffusion.
Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed. They are very important nonverbal communication clues, but are transient and of low intensity thus diffic...
详细信息
ISBN:
(纸本)9798350301298
Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed. They are very important nonverbal communication clues, but are transient and of low intensity thus difficult to recognize. Recently deep learning based methods have been developed for micro-expression (ME) recognition using feature extraction and fusion techniques, however, targeted feature learning and efficient feature fusion still lack further study according to the ME characteristics. To address these issues, we propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a convolutional Displacement Generation Module (DGM) with self-supervised learning is used to extract dynamic features from onset/apex frames targeted to the subsequent ME recognition task, and a well-designed Transformer Fusion mechanism composed of three Transformer-based fusion modules (local, global fusions based on AU regions and full-face fusion) is applied to extract the multi-level informative features after DGM for the final ME prediction. The extensive experiments with solid leave-one-subject-out (LOSO) evaluation results have demonstrated the superiority of our proposed FRL-DGT to state-of-the-art methods.
Retrieval-augmented generation (RAG) is used in natural language processing (NLP) to provide query-relevant information in enterprise documents to large language models (LLMs). Such enterprise context enables the LLMs...
详细信息
ISBN:
(纸本)9798350365474
Retrieval-augmented generation (RAG) is used in natural language processing (NLP) to provide query-relevant information in enterprise documents to large language models (LLMs). Such enterprise context enables the LLMs to generate more informed and accurate responses. When enterprise data is primarily videos, AI models like vision language models (VLMs) are necessary to convert information in videos into text. While essential, this conversion is a bottleneck, especially for large corpus of videos. It delays the timely use of enterprise videos to generate useful responses. We propose ViTA, a novel method that leverages two unique characteristics of VLMs to expedite the conversion process. As VLMs output more text tokens, they incur higher latency. In addition, large (heavyweight) VLMs can extract intricate details from images and videos, but they incur much higher latency per output token when compared to smaller (lightweight) VLMs that may miss details. To expedite conversion, ViTA first employs a lightweight VLM to quickly understand the gist or overview of an image or a video clip, and directs a heavyweight VLM (through prompt engineering) to extract additional details by using only a few (preset number of) output tokens. Our experimental results show that ViTA expedites the conversion time by as much as 43%, without compromising the accuracy of responses when compared to a baseline system that only uses a heavyweight VLM.
Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learnin...
详细信息
ISBN:
(纸本)9798350301298
Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.
Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural...
详细信息
ISBN:
(纸本)9798350301298
Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method.
Establishing pixel-level matches between image pairs is vital for a variety of computervision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack ...
详细信息
ISBN:
(纸本)9798350301298
Establishing pixel-level matches between image pairs is vital for a variety of computervision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discriminative ability in texture-less regions and keypoint detectors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel image matching method is proposed by Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agentbased Transformers (D2Former), including a contextual feature descriptor learning (CFDL) module and a hierarchical keypoint detector learning (HKDL) module. The proposed D2Former enjoys several merits. First, the proposed CFDL module can model long-range contexts efficiently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods.
暂无评论