Image and video analysis requires not only accurate object detection but also the understanding of relationships among detected objects. Common solutions to relation modeling typically resort to stand-alone object det...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
Image and video analysis requires not only accurate object detection but also the understanding of relationships among detected objects. Common solutions to relation modeling typically resort to stand-alone object detectors followed by non-differentiable post-processing techniques. Recently introduced detection transformers (DETR) perform end-to-end object detection based on a bipartite matching loss. Such methods, however, lack the ability to jointly detect objects and resolve object associations. In this paper, we build on the DETR approach and extend it to the joint detection of objects and their relationships by introducing an approximated bipartite matching. While our method can generalize to an arbitrary number of objects, we here focus on the modeling of object pairs and their relations. In particular, we apply our method PairDETR to the problem of detecting human bodies and faces, and associating them for the same person. Our approach not only eliminates the need for hand-designed post-processing but also achieves excellent results for body-face associations. We evaluate PairDETR on the challenging CrowdHuman and CityPersons datasets and demonstrate a large improvement over the state of the art. Our training code and pre-trained models are available at https://***/mts-ai/pairdetr
Prompt learning has emerged as a potent and resource-efficient technique in large vision-Language Models (VLMs). However, its application in adapting VLMs to specialized domains like remote sensing and medical imaging...
详细信息
ISBN:
(纸本)9798350353006
Prompt learning has emerged as a potent and resource-efficient technique in large vision-Language Models (VLMs). However, its application in adapting VLMs to specialized domains like remote sensing and medical imaging, termed domain prompt learning, remains relatively unexplored. Although large-scale domain-specific foundation models offer a potential solution, their focus on a singular vision level presents challenges in prompting both vision and language modalities. To address this limitation, we propose leveraging domain-specific knowledge from these foundation models to transfer the robust recognition abilities of VLMs from generalized to specialized domains, employing quaternion networks. Our method entails utilizing domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within quaternion networks. Furthermore, we introduce a hierarchical approach that derives vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. Through this mechanism, quaternion networks can effectively explore intermodal relationships in specific domains, facilitating domain-specific vision-language contrastive learning. Extensive experiments conducted on domain-specific datasets demonstrate that our proposed method achieves new state-of-the-art results in prompt learning. Codes are available at https://***/caoql98/DPLQ.
Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled groundtruth annotations for learning occlusion. In this work...
详细信息
ISBN:
(纸本)9798350353006
Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled groundtruth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the genera...
详细信息
ISBN:
(纸本)9798350353006
Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless, conventional prompt learning methods overfit their training distribution and lose the generalization ability on test distributions. To improve generalization across various distribution shifts, we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this frame-work, the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships, we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass, avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts.
Diffusion models ( DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high- quality and realistic data samples. However, their widespread use has also brought forth new ...
详细信息
ISBN:
(纸本)9798350353006
Diffusion models ( DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high- quality and realistic data samples. However, their widespread use has also brought forth new challenges in model security, which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT, a simple but generic and efficient approach that does not require costly training to effectively fool latent diffusion models (LDMs). The approach is based on the observation that cross-attention layers exhibits higher sensitivity to gradient change, allowing for leveraging subtle perturbations on published images to significantly corrupt the generated images. We show that a subtle perturbation on an image can significantly impact the cross-attention layers, thus changing the mapping between text and image during the fine-tuning of customized diffusion models. Extensive experiments demonstrate that CAAT is compatible with diverse diffusion models and out- performs baseline attack methods in a more effective (more noise) and efficient (twice as fast as Anti-DreamBooth and Mist) manner.
Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for...
详细信息
ISBN:
(纸本)9798350353006
Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for convolutional neural networks (CNNs) provide quantization results comparable to full-precision counterparts. Directly applying them to vision transformers (ViTs), however, incurs severe performance degradation, mainly due to the differences in architectures between CNNs and ViTs. In particular, the distribution of activations for each channel vary drastically according to input instances, making PTQ methods for CNNs inappropriate for ViTs. To address this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To this end, we propose to split the channels of activation maps into multiple groups dynamically for each input instance, such that activations within each group share similar statistical properties. We also extend our scheme to quantize softmax attentions across tokens. In addition, the number of groups for each layer is adjusted to minimize the discrepancies between predictions from quantized and full-precision models, under a bit-operation (BOP) constraint. We show extensive experimental results on image classification, object detection, and instance segmentation, with various transformer architectures, demonstrating the effectiveness of our approach.
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation at...
详细信息
ISBN:
(纸本)9798350353006
There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicvisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://***. io/***/
In this paper, we tackle the problem of self-supervised video alignment and activity progress prediction using in-the-wild videos. Our proposed self-supervised representation learning method carefully addresses differ...
详细信息
ISBN:
(纸本)9798350353006
In this paper, we tackle the problem of self-supervised video alignment and activity progress prediction using in-the-wild videos. Our proposed self-supervised representation learning method carefully addresses different action orderings, redundant actions, and background frames to generate improved video representations compared to previous methods. Our model generalizes temporal cycleconsistency learning to allow for more flexibility in determining cycle-consistent neighbors. More specifically, to handle repeated actions, we propose a multi-neighbor cycle consistency and a multi-cycle-back regression loss by finding multiple soft nearest neighbors using a Gaussian Mixture Model. To handle background and redundant frames, we introduce a context-dependent drop function in our framework, discouraging the alignment of droppable frames. On the other hand, to learn from videos of multiple activities jointly, we propose a multi-head crosstask network, allowing us to embed a video and estimate progress without knowing its activity label. Experiments on multiple datasets show that our method outperforms the state-of-the-art for video alignment and progress prediction. (1)
Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply u...
详细信息
ISBN:
(纸本)9798350353006
Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However, such autoregressive Transformers may not fit vision tasks well, as vision task sequences usually lack the sequential dependencies typically observed in natural languages. In this work, we design Masked AutoDecoder (MAD), an effective multi-task vision generalist. MAD consists of two core designs. First, we develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies comprehensively and decode vision task sequences in parallel. Second, we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences. In this way, MAD handles all the tasks by a single network branch and a simple cross-entropy loss with minimal task-specific designs. Extensive experiments demonstrate the great potential of MAD as a new paradigm for unifying various vision tasks. MAD achieves superior performance and inference efficiency compared to autoregressive counterparts while obtaining competitive accuracy with task-specific models. Code will be released at https://***/hanqiu-hq/MAD.
Significant progress in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although image-language models can efficiently represent both video and languag...
详细信息
ISBN:
(纸本)9798350353006
Significant progress in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although image-language models can efficiently represent both video and language branches, they typically employ goal-free vision perception and do not interact vision with language well during the answer generation, thus omitting crucial visual cues. In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior, distinguishing it from previous research. Specifically, we develop a language-aware gating mechanism to replace the standard cross-attention, avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module, which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure extracting multi-grained visual semantics associated with the questions. We conduct evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition, we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset.
暂无评论