Recent progress in the few-shot adaptation of visionLanguage Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However,...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Recent progress in the few-shot adaptation of visionLanguage Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, taskspecific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of promptlearning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.
We investigate the problem of incremental learning for object counting, where a method must learn to count a variety of object classes from a sequence of datasets. A naïve approach to incremental object counting ...
We investigate the problem of incremental learning for object counting, where a method must learn to count a variety of object classes from a sequence of datasets. A naïve approach to incremental object counting would suffer from catastrophic forgetting, where it would suffer from a dramatic performance drop on previous tasks. In this paper, we propose a new exemplar-free functional regularization method, called Density Map Distillation (DMD). During training, we introduce a new counter head for each task and introduce a distillation loss to prevent forgetting of previous tasks. Additionally, we introduce a cross-task adaptor that projects the features of the current backbone to the previous backbone. This projector allows for the learning of new features while the backbone retains the relevant features for previous tasks. Finally, we set up experiments of incremental learning for counting new objects. Results confirm that our method greatly reduces catastrophic forgetting and outperforms existing methods.
Performance analyses based on videos are commonly used by coaches of athletes in various sports disciplines. In individual sports, these analyses mainly comprise the body posture. This paper focuses on the disciplines...
Performance analyses based on videos are commonly used by coaches of athletes in various sports disciplines. In individual sports, these analyses mainly comprise the body posture. This paper focuses on the disciplines of triple, high, and long jump, which require fine-grained locations of the athlete’s body. Typical human pose estimation datasets provide only a very limited set of keypoints, which is not sufficient in this case. Therefore, we propose a method to detect arbitrary keypoints on the whole body of the athlete by leveraging the limited set of annotated keypoints and auto-generated segmentation masks of body parts. Evaluations show that our model is capable of detecting keypoints on the head, torso, hands, feet, arms, and legs, including also bent elbows and knees. We analyze and compare different techniques to encode desired keypoints as the model’s input and their embedding for the Transformer backbone.
Dynamic Facial Expression recognition (DFER) has received significant interest in the recent years dictated by its pivotal role in enabling empathic and human-compatible technologies. Achieving robustness towards in-t...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Dynamic Facial Expression recognition (DFER) has received significant interest in the recent years dictated by its pivotal role in enabling empathic and human-compatible technologies. Achieving robustness towards in-the-wild data in DFER is particularly important for real-world applications. One of the directions aimed at improving such models is multimodal emotion recognition based on audio and video data. Multimodal learning in DFER increases the model capabilities by leveraging richer, complementary data representations. Within the field of multimodal DFER, recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multi-modal encoders [40]. Another line of research has focused on adapting pre-trained static models for DFER [8]. In this work, we propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders. We identify main challenges associated with this task, namely, intra-modality adaptation, cross-modal alignment, and temporal adaptation, and propose solutions to each of them. As a result, we demonstrate improvement over current state-of-the-art on two popular DFER benchmarks, namely DFEW [19] and MFAW [29].
Recently, zero-cost proxies for neural architecture search (NAS) have attracted increasing attention. They allow us to discover top-performing neural networks through architecture scoring without requiring training a ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Recently, zero-cost proxies for neural architecture search (NAS) have attracted increasing attention. They allow us to discover top-performing neural networks through architecture scoring without requiring training a very large network (i.e., supernet). Thus, it can save significant computation resources to complete the search. However, to our knowledge, no single proxy works best for different tasks and scenarios. To consolidate the strength of different proxies and to reduce search bias, we propose a unified proxy neural architecture search framework (UP-NAS) which learns a multi-proxy estimator for predicting a unified score by combining multiple zero-cost proxies. The predicted score is then used for an efficient gradient-ascent architecture search in the embedding space of the neural network architectures. Our approach can not only save computational time required for multiple proxies during architecture search but also gain the flexibility to consolidate the existing proxies on different tasks. We conduct experiments on the search spaces of NAS-Bench-201 and DARTS in different datasets. The results demonstrate the effectiveness of the proposed approach. Code is available at https://***/AI-Application-andIntegration-Lab/UP-NAS.
Virtually all of deep learning literature relies on the assumption of large amounts of available training data. Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" fo...
详细信息
ISBN:
(纸本)9781665448994
Virtually all of deep learning literature relies on the assumption of large amounts of available training data. Indeed, even the majority of few-shot learning methods rely on a large set of "base classes" for pre-training. This assumption, however, does not always hold. For some tasks, annotating a large number of classes can be infeasible, and even collecting the images themselves can be a challenge in some scenarios. In this paper, we study this problem and call it "Small Data" setting, in contrast to "Big Data." To unlock the full potential of small data, we propose to augment the models with annotations for other related tasks, thus increasing their generalization abilities. In particular, we use the richly annotated scene parsing dataset ADE20K to construct our realistic Long-tail recognition with Diverse Supervision (LRDS) benchmark, by splitting the object categories into head and tail based on their distribution. Following the standard few-shot learning protocol, we use the head classes for representation learning and the tail classes for evaluation. Moreover, we further subsample the head categories and images to generate two novel settings which we call "Scarce-Class" and "Scarce-Image," respectively corresponding to the shortage of training classes and images. Finally, we analyze the effect of applying various additional supervision sources under the proposed settings. Our experiments demonstrate that densely labeling a small set of images can indeed largely remedy the small data constraints. Our code and benchmark are available at https://***/BinahHu/ADE-FewShot.
Classification of seat occupancy in in-vehicle interior remains a significant challenge and is a promising area in the functionality of new generation cars. As majority of accidents are related to the driver errors th...
详细信息
ISBN:
(纸本)9781665448994
Classification of seat occupancy in in-vehicle interior remains a significant challenge and is a promising area in the functionality of new generation cars. As majority of accidents are related to the driver errors the consequences of not wearing, or improperly wearing, a seat belt are clear. The NHTSA reports that 47% of the 22,215 passenger vehicle occupants killed in 2019 were not wearing seat belts. To address this problem we propose a deep learning based framework to classify seat occupancy into seven most important categories. In this study, we present an interpretable and explainable AI approach that takes advantage of pre-trained networks including ResNet152V2, DenseNet121 and the most recent EfficientNetB0-B5-B7 to calculate the feature vectors followed by an adjusted densely-connected classifier. Our model provides an interpretation of its results through the identification of object parts without direct supervision and their contribution towards classification. We explore and propose two new statistical metrics including HGD(score) and HGDA(score) which are based on the multivariate Gaussian distribution for assessing heatmaps without using human-annotated object parts to quantify the interpretability of our network. We demonstrate that the calculated statistical metrics lead to an interpretable model that correlates with the framework accuracy and can flexibly analyze heatmaps at any resolution for different user needs. Furthermore, extensive experiments have been performed on the SVIRO database [7] including 7,500 sceneries for BMW X5 model which confirm the ability of the developed framework based on the EfficientNetB5 architecture to classify seat occupancy into seven main categories with 79.87% overall accuracy as well as 95.92% recall and 90.32% specificity for empty seats recognition, which is a state-of-the-art result in this domain.
Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computervision techniques for surgical decision supp...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computervision techniques for surgical decision support. Segmentation performance of such models is traditionally reported using metrics of overlap such as IoU. However, imbalanced and unrealistic representation of classes in the training data and suboptimal selection of reported metrics have the potential to skew nominal segmentation performance and thereby ultimately limit clinical translation. In this work, we systematically analyze the impact of class characteristics (i.e., organ size differences), training and test data composition (i.e., representation of positive and negative examples), and modeling parameters (i.e., foreground-to-background class weight) on eight segmentation metrics: accuracy, precision, recall, IoU, F1 score (Dice Similarity Coefficient), specificity, Hausdorff Distance, and Average Symmetric Surface Distance. Our findings support two adjustments to account for data biases in surgical data science: First, training on datasets that are similar to the clinical real-world scenarios in terms of class distribution, and second, class weight adjustments to optimize segmentation model performance with regard to metrics of particular relevance in the respective clinical setting.
Generative Adversarial Networks (GANs) have shown an outstanding ability to generate high-quality images with visual realism and similarity to real images. This paper presents a new architecture for thermal image enha...
Generative Adversarial Networks (GANs) have shown an outstanding ability to generate high-quality images with visual realism and similarity to real images. This paper presents a new architecture for thermal image enhancement. Precisely, the strengths of architecture-based vision transformers and generative adversarial networks are exploited. The thermal loss feature introduced in our approach is specifically used to produce high-quality images. Thermal image enhancement also relies on fine-tuning based on visible images, resulting in an overall improvement in image quality. A visual quality metric was used to evaluate the performance of the proposed architecture. Significant improvements were found over the original thermal images and other enhancement methods established on a subset of the KAIST dataset. The performance of the proposed enhancement architecture is also verified on the detection results by obtaining better performance with a considerable margin regarding different versions of the YOLO detector.
The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tu...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions. In this study, we explicitly address this problem by employing class-aware prototype alignment weighted by mean class probabilities obtained for the test sample and filtered augmented views. Additionally, we ensure that the class probabilities are as accurate as possible by performing prototype discrimination using contrastive learning. The combination of alignment and discriminative loss serves as a geometric regularizer, preventing the prompt representation from collapsing onto a single class and effectively bridging the distribution gap between the source and test domains. Our method, named PromptSync, synchronizes the prompts for each test sample on both the text and vision branches of the V-L model. In empirical evaluations on the domain generalization benchmark, our method outperforms previous best methods by 2.33% in overall performance, by 1% in base-to-novel generalization, and by 2.84% in cross-dataset transfer tasks.
暂无评论