Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in visi...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in vision transformers (ViTs), where post-pre-training and fine-tuning on new tasks can significantly degrade the model’s original general abilities. For instance, a DINO ViT-Base/16 pre-trained on ImageNet-1k loses over 70% accuracy on ImageNet-1k after just 10 iterations of fine-tuning on CIFAR-100. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains while preserving their initial knowledge. In this work, we study two new parameter-efficient fine-tuning strategies: (1) Block Expansion, and (2) Low-rank adaptation (LoRA). Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains while offering significantly greater parameter efficiency. Notably, we find that Block Expansion experiences only a minimal performance drop in the pre-training domain, thereby effectively mitigating catastrophic forgetting in pre-trained ViTs
1
.
Medical Image Foundation Models have proven to be powerful tools for mask prediction across various datasets. However, accurately assessing the uncertainty of their predictions remains a significant challenge. To addr...
详细信息
In this paper we address multi-target domain adaptation (MTDA), where given one labeled source dataset and multiple unlabeled target datasets that differ in data distributions, the task is to learn a robust predictor ...
详细信息
ISBN:
(纸本)9781665445092
In this paper we address multi-target domain adaptation (MTDA), where given one labeled source dataset and multiple unlabeled target datasets that differ in data distributions, the task is to learn a robust predictor for all the target domains. We identify two key aspects that can help to alleviate multiple domain-shifts in the MTDA: feature aggregation and curriculum learning. To this end, we propose Curriculum Graph Co-Teaching (CGCT) that uses a dual classifier head, with one of them being a graph convolutional network (GCN) which aggregates features from similar samples across the domains. To prevent the classifiers from over-fitting on its own noisy pseudo-labels we develop a co-teaching strategy with the dual classifier head that is assisted by curriculum learning to obtain more reliable pseudo-labels. Furthermore, when the domain labels are available, we propose Domain-aware Curriculum Learning (DCL), a sequential adaptation strategy that first adapts on the easier target domains, followed by the harder ones. We experimentally demonstrate the effectiveness of our proposed frameworks on several benchmarks and advance the state-of-the-art in the MTDA by large margins (e.g. +5.6% on the DomainNet).
Post-mortem iris recognition is an emerging application of iris-based human identification in a forensic setup, able to correctly identify deceased subjects even three weeks post-mortem. This technique thus is conside...
Post-mortem iris recognition is an emerging application of iris-based human identification in a forensic setup, able to correctly identify deceased subjects even three weeks post-mortem. This technique thus is considered as an important component of future forensic toolkits. The current advancements in this field are seriously slowed down by exceptionally difficult data collection, which can happen in mortuary conditions, at crime scenes, or in “body farm” facilities. This paper makes a novel contribution to facilitate progress in post-mortem iris recognition by offering a conditional StyleGAN-based iris synthesis model, trained on the largest-available dataset of post-mortem iris samples acquired from more than 350 subjects, generating – through appropriate exploration of StyleGAN latent space – multiple within-class (same identity) and between-class (different new identities) post-mortem iris images, compliant with ISO/IEC 29794-6, and with decomposition deformations controlled by the requested PMI (post mortem interval). Besides an obvious application to enhance the existing, very sparse, post-mortem iris datasets to advance – among others – iris presentation attack endeavors, we anticipate it may be useful to generate samples that would expose professional forensic human examiners to never-seen-before deformations for various PMIs, increasing their training effectiveness. The source codes and model weights are made available with the paper.
Current state-of-the-art approaches for Semi-supervised Video Object Segmentation (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. This results in high-qualit...
详细信息
ISBN:
(纸本)9781665445092
Current state-of-the-art approaches for Semi-supervised Video Object Segmentation (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame. This results in high-quality segmentation across challenging scenarios such as changes in appearance and occlusion. But it also leads to unnecessary computations for stationary or slow-moving objects where the change across frames is minimal. In this work, we exploit this observation by using temporal information to quickly identify frames with minimal change and skip the heavyweight mask generation step. To realize this efficiency, we propose a novel dynamic network that estimates change across frames and decides which path - computing a full network or reusing previous frame's feature - to choose depending on the expected similarity. Experimental results show that our approach significantly improves inference speed without much accuracy degradation on challenging Semi-VOS datasets - DAVIS 16, DAVIS 17, and YouTube-VOS. Furthermore, our approach can be applied to multiple Semi-VOS methods demonstrating its generality.
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer...
详细信息
ISBN:
(纸本)9781665445092
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training STR models only on fewer real labels (STR with fewer labels) is important when we have to train STR models without synthetic data: for handwritten or artistic texts that are difficult to generate synthetically and for languages other than English for which we do not always have synthetic data. However, there has been implicit common knowledge that training STR models on real data is nearly impossible because real data is insufficient. We consider that this common knowledge has obstructed the study of STR with fewer labels. In this work, we would like to reactivate STR with fewer labels by disproving the common knowledge. We consolidate recently accumulated public real data and show that we can train STR models satisfactorily only with real labeled data. Subsequently, we find simple data augmentation to fully exploit real data. Furthermore, we improve the models by collecting unlabeled data and introducing semi- and self-supervised methods. As a result, we obtain a competitive model to state-of-the-art methods. To the best of our knowledge, this is the first study that 1) shows sufficient performance by only using real labels and 2) introduces semi- and self-supervised methods into STR with fewer labels.
3D morphable models are widely used for the shape representation of an object class in computervision and graphics applications. In this work, we focus on deep 3D morphable models that directly apply deep learning on...
详细信息
ISBN:
(纸本)9781665445092
3D morphable models are widely used for the shape representation of an object class in computervision and graphics applications. In this work, we focus on deep 3D morphable models that directly apply deep learning on 3D mesh data with a hierarchical structure to capture information at multiple scales. While great efforts have been made to design the convolution operator, how to best aggregate vertex features across hierarchical levels deserves further attention. In contrast to resorting to mesh decimation, we propose an attention based module to learn mapping matrices for better feature aggregation across hierarchical levels. Specifically, the mapping matrices are generated by a compatibility function of the keys and queries. The keys and queries are trainable variables, learned by optimizing the target objective, and shared by all data samples of the same object class. Our proposed module can be used as a train-only drop-in replacement for the feature aggregation in existing architectures for both downsampling and upsampling. Our experiments show that through the end-to-end training of the mapping matrices, we achieve state-of-the-art results on a variety of 3D shape datasets in comparison to existing morphable models.
The current evaluation protocol of long-tailed visual recognition trains the classification model on the long-tailed source label distribution and evaluates its performance on the uniform target label distribution. Su...
详细信息
ISBN:
(纸本)9781665445092
The current evaluation protocol of long-tailed visual recognition trains the classification model on the long-tailed source label distribution and evaluates its performance on the uniform target label distribution. Such protocol has questionable practicality since the target may also be long-tailed. Therefore, we formulate long-tailed visual recognition as a label shift problem where the target and source label distributions are different. One of the significant hurdles in dealing with the label shift problem is the entanglement between the source label distribution and the model prediction. In this paper, we focus on disentangling the source label distribution from the model prediction. We first introduce a simple but overlooked baseline method that matches the target label distribution by post-processing the model prediction trained by the cross-entropy loss and the Softmax function. Although this method surpasses state-of-the-art methods on benchmark datasets, it can be further improved by directly disentangling the source label distribution from the model prediction in the training phase. Thus, we propose a novel method, LAbel distribution DisEntangling (LADE) loss based on the optimal bound of Donsker-Varadhan representation. LADE achieves state-of-the-art performance on benchmark datasets such as CIFAR-100-LL Places-LT ImageNet-LL and iNaturalist 2018. Moreover LADE outperforms existing methods on various shifted target label distributions, showing the general adaptability of our proposed method.
We introduce a meta-regularization framework for learning-based image registration. Current learning-based image registration methods use high-resolution architectures such as U-Nets to produce spatial transformations...
详细信息
ISBN:
(纸本)9781665445092
We introduce a meta-regularization framework for learning-based image registration. Current learning-based image registration methods use high-resolution architectures such as U-Nets to produce spatial transformations, and impose simple and explicit regularization on the output of the network to ensure that the estimated displacements are smooth. While this approach works well on small deformations, it has been known to struggle when the deformations are large. Our method uses a more advanced form of meta-regularization to increase the generalization ability of learned registration models. We motivate our approach based on Reproducing Kernel Hilbert Space (RKHS) theory, and approximate that framework via a meta-regularization convolutional layer with radially symmetric, positive semi-definite filters that inherent its regularization properties. We then provide a method to learn such regularization filters while also learning to register. Our experiments on synthetic and real datasets as well as ablation analysis show that our method can improve anatomical correspondence compared to competing methods, and reduce the percentage of folding and tear in the large deformation setting, reflecting better regularization and model generalization.
Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. In this work, we ...
详细信息
ISBN:
(纸本)9781665445092
Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. In this work, we propose LaneATT: an anchor-based deep lane detection model, which, akin to other generic deep object detectors, uses the anchors for the feature pooling step. Since lanes follow a regular pattern and are highly correlated, we hypothesize that in some cases global information may be crucial to infer their positions, especially in conditions such as occlusion, missing lane markers, and others. Thus, this work proposes a novel anchor-based attention mechanism that aggregates global information. The model was evaluated extensively on three of the most widely used datasets in the literature. The results show that our method outperforms the current state-of-the-art methods showing both higher efficacy and efficiency. Moreover, an ablation study is performed along with a discussion on efficiency trade-off options that are useful in practice.
暂无评论