We introduce a novel sequential modeling approach which enables learning a Large vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in wh...
详细信息
ISBN:
(纸本)9798350353006
We introduce a novel sequential modeling approach which enables learning a Large vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions with-out needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion to-kens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting...
详细信息
ISBN:
(纸本)9798350353006
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer(C-Former), reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or vi...
详细信息
ISBN:
(纸本)9798350353006
Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://***/TL-UESTC/UniMoS.
The proceedings contain 307 papers. The topics discussed include: deep features for recognizing disguised faces in the wild;unconstrained fingerphoto database;hybrid user-independent and user-dependent offline signatu...
ISBN:
(纸本)9781538661000
The proceedings contain 307 papers. The topics discussed include: deep features for recognizing disguised faces in the wild;unconstrained fingerphoto database;hybrid user-independent and user-dependent offline signature verification with a two-channel CNN;it takes two to tango: cascading off-the-shelf face detectors;time analysis of pulse-based face anti-spoofing in visible and NIR;a deep face identification network enhanced by facial attributes prediction;gait recognition by deformable registration;and fusion of handcrafted and deep learning features for large-scale multiple iris presentation attack detection.
Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses ...
详细信息
ISBN:
(纸本)9798350353006
Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach. On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://***/SegoleneMartin/transductive-CLIP.
Given the power of vision transformers, a new learning paradigm, pretraining and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel...
详细信息
ISBN:
(纸本)9798350353006
Given the power of vision transformers, a new learning paradigm, pretraining and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pretrained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the crossmode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://***/20000yshust/SWARM.
Neonatal resuscitations demand an exceptional level of attentiveness from providers, who must process multiple streams of information simultaneously. Gaze strongly influences decision making;thus, understanding where ...
详细信息
ISBN:
(纸本)9798350365474
Neonatal resuscitations demand an exceptional level of attentiveness from providers, who must process multiple streams of information simultaneously. Gaze strongly influences decision making;thus, understanding where a provider is looking during neonatal resuscitations could inform provider training, enhance real-time decision support, and improve the design of delivery rooms and neonatal intensive care units (NICUs). Current approaches to quantifying neonatal providers' gaze rely on manual coding or simulations, which limit scalability and utility. Here, we introduce an automated, real-time, deep learning approach capable of decoding provider gaze into semantic classes directly from first-person point-of-view videos recorded during live resuscitations. Combining state-of-the-art, real-time segmentation with vision-language models, our low-shot pipeline attains 91% classification accuracy in identifying gaze targets without training. Upon fine-tuning, the performance of our gaze-guided vision transformer exceeds 98% accuracy in semantic gaze analysis, approaching human-level precision. This system, capable of real-time inference, enables objective quantification of provider attention dynamics during live neonatal resuscitation. Our approach offers a scalable solution that seamlessly integrates with existing infrastructure for data-scarce gaze analysis, thereby offering new opportunities for understanding and refining clinical decision making.
CLIP has demonstrated marked progress in visual recognition due to its powerful pre-training on large-scale image-text pairs. However, it still remains a critical challenge: how to transfer image-level knowledge into ...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
CLIP has demonstrated marked progress in visual recognition due to its powerful pre-training on large-scale image-text pairs. However, it still remains a critical challenge: how to transfer image-level knowledge into pixel-level understanding tasks such as semantic segmentation. In this paper, to solve the mentioned challenge, we analyze the gap between the capability of the CLIP model and the requirement of the zero-shot semantic segmentation task. Based on our analysis and observations, we propose a novel method for zero-shot semantic segmentation, dubbed CLIP-RC (CLIP with Regional Clues), bringing two main insights. On the one hand, a region-level bridge is necessary to provide fine-grained semantics. On the other hand, overfitting should be mitigated during the training stage. Benefiting from the above discoveries, CLIP-RC achieves state-of-the-art performance on various zero-shot semantic segmentation benchmarks, including PASCAL VOC, PASCAL Context, and COCO-Stuff 164K. Code will be available at https://***/Jittor/JSeg.
vision-language (VL) models have achieved unprecedented success recently, in which the connection module is the key to bridge the modality gap. Nevertheless, the abundant visual clues are not sufficiently exploited in...
详细信息
ISBN:
(纸本)9798350353006
vision-language (VL) models have achieved unprecedented success recently, in which the connection module is the key to bridge the modality gap. Nevertheless, the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side, most existing approaches only use the last feature of the vision tower, without using the low-level features. On the language side, most existing methods only introduce shallow vision-language interactions. In this paper, we present a vision-inspired vision-language connection module, dubbed as VIVL, which efficiently exploits the vision cue for VL models. To take advantage of the lower-level information from the vision tower, a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers, which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions, we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of- the-art method by 18.1 CIDEr when training from scratch on the COCO caption task, which greatly improves the data efficiency. When used as a plug-in module, VIVL consistently improves the performance for various backbones and VL frameworks, delivering new state-of-the-art results on multiple benchmarks, e.g., NoCaps and VQAv2.
Recent progress in vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utili...
详细信息
ISBN:
(纸本)9798350353006
Recent progress in vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition, the additional pose modality may bridge the gap between vision and text to improve the effectiveness of cross-modality learning. In this paper, we propose a novel framework, called Pose-enhanced vision-Language (PeVL) model, to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. Our PeVL model includes two novel components: an Unsymmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) module. The UCMR block includes Pose-guided Visual Refinement (P2V-R) and Visual-enriched Pose Refinement (V2P-R) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels, and a Semantic-Guided Loss, enabling effective contrastive learning with text. Built upon a pre-trained VL foundation model, our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine-grained human action recognition datasets, achieving a new SOTA with a significantly small number of FLOPs for low-cost re-training.(1)
暂无评论