Unsupervised Domain Adaptation (UDA) can tackle the challenge that convolutional neural network (CNN)-based approaches for semantic segmentation heavily rely on the pixel-level annotated data, which is labor-intensive...
详细信息
ISBN:
(纸本)9781665445092
Unsupervised Domain Adaptation (UDA) can tackle the challenge that convolutional neural network (CNN)-based approaches for semantic segmentation heavily rely on the pixel-level annotated data, which is labor-intensive. However, existing UDA approaches in this regard inevitably require the full access to source datasets to reduce the gap between the source and target domains during model adaptation, which are impractical in the real scenarios where the source datasets are private, and thus cannot be released along with the well-trained source models. To cope with this issue, we propose a source-free domain adaptation framework for semantic segmentation, namely SFDA, in which only a well-trained source model and an unlabeled target domain dataset are available for adaptation. SFDA not only enables to recover and preserve the source domain knowledge from the source model via knowledge transfer during model adaptation, but also distills valuable information from the target domain for self-supervised learning. The pixel- and patch-level optimization objectives tailored for semantic segmentation are seamlessly integrated in the framework. The extensive experimental results on numerous benchmark datasets highlight the effectiveness of our framework against the existing UDA approaches relying on source data.
This paper studies the problem of semi-supervised video object segmentation(VOS). Multiple works have shown that memory-based approaches can be effective for video object segmentation. They are mostly based on pixel-l...
详细信息
ISBN:
(纸本)9781665445092
This paper studies the problem of semi-supervised video object segmentation(VOS). Multiple works have shown that memory-based approaches can be effective for video object segmentation. They are mostly based on pixel-level matching, both spatially and temporally. The main shortcoming of memory-based approaches is that they do not take into account the sequential order among frames and do not exploit object-level knowledge from the target. To address this limitation, we propose to Learn position and target Consistency framework for Memory-based video object segmentation, termed as LCM. It applies the memory mechanism to retrieve pixels globally, and meanwhile learns position consistency for more reliable segmentation. The learned location response promotes a better discrimination between target and distractors. Besides, LCM introduces an object-level relationship from the target to maintain target consistency, making LCM more robust to error drifting. Experiments show that our LCM achieves state-of-the-art performance on both DAVIS and Youtube-VOS benchmark. And we rank the 1st in the DAVIS 2020 challenge semi-supervised VOS task.
Face recognition systems are widely used in real-world scenarios but are susceptible to physical and digital attacks. Effective methods for unified detection of both physical face attacks and digital face attacks are ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Face recognition systems are widely used in real-world scenarios but are susceptible to physical and digital attacks. Effective methods for unified detection of both physical face attacks and digital face attacks are essential to ensure the reliability of face recognition systems. However, how to obtain a unified face attack detection model that has adequate ability of fine-grained perception and cross-domain generalization ability remains an open challenge. To address this issue, we first propose a two-stage training strategy, which utilizes unlabeled face images with masked image modeling and unleashes the potential of vision transformers. Furthermore, we propose a novel method termed as Micro Disturbance, which successfully enriches the representation distribution of forged faces and increases the diversity of the training data, thereby addressing the issue of cross-domain generalization. Attribute to the effectiveness of our proposed methods, our model finally wins the third place in the 5th Face Anti-Spoofing Challenge@CVPR2024, with an impressive ACER score of 5.511.
In this paper, we propose a feature embedding based video object segmentation (VOS) method which is simple, fast and effective. The current VOS task involves two main challenges: object instance differentiation and cr...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we propose a feature embedding based video object segmentation (VOS) method which is simple, fast and effective. The current VOS task involves two main challenges: object instance differentiation and cross-frame instance alignment. Most state-of-the-art matching based VOS methods simplify this task into a binary segmentation task and tackle each instance independently. In contrast, we decompose the VOS task into two subtasks: global embedding learning that segments foreground objects of each frame in a pixel-to-pixel manner, and instance feature embedding learning that separates instances. The outputs of these two subtasks are fused to obtain the final instance masks quickly and accurately. Through using the relation among different instances per-frame as well as temporal relation across different frames, the proposed network learns to differentiate multiple instances and associate them properly in one feed-forward manner. Extensive experimental results on the challenging DAVIS[34] and YoutubeVOS [57] datasets show that our method achieves better performances than most counterparts in each case.
Image colourisation is an ill-posed problem, with multiple correct solutions which depend on the context and object instances present in the input datum. Previous approaches attacked the problem either by requiring in...
详细信息
ISBN:
(纸本)9781665448994
Image colourisation is an ill-posed problem, with multiple correct solutions which depend on the context and object instances present in the input datum. Previous approaches attacked the problem either by requiring intense user-interactions or by exploiting the ability of convolutional neural networks (CNNs) in learning image-level (context) features. However, obtaining human hints is not always feasible and CNNs alone are not able to learn entity-level semantics, unless multiple models pre-trained with supervision are considered. In this work, we propose a single network, named UCapsNet, that takes into consideration the image-level features obtained through convolutions and entity-level features captured by means of capsules. Then, by skip connections over different layers, we enforce collaboration between such the convolutional and entity factors to produce a high-quality and plausible image colourisation. We pose the problem as a classification task that can be addressed by a fully unsupervised approach, thus requires no human effort. Experimental results on three benchmark datasets show that our approach outperforms existing methods on standard quality metrics and achieves state-of-the-art performances on image colourisation. A large scale user study shows that our method is preferred over existing solutions. Code available at https://***/Riretta/Image_Colourisation_WiCV_2021.
This paper lies at the intersection of three research areas: human action recognition, egocentric vision, and visual event-based sensors. The main goal is the comparison of egocentric action recognition performance un...
详细信息
ISBN:
(纸本)9783031048814;9783031048807
This paper lies at the intersection of three research areas: human action recognition, egocentric vision, and visual event-based sensors. The main goal is the comparison of egocentric action recognition performance under either of two visual sources: conventional images, or event-based visual data. In this work, the events, as triggered by asynchronous event sensors or their simulation, are spatio-temporally aggregated into event frames (a grid-like representation). This allows to use exactly the same neural model for both visual sources, thus easing a fair comparison. Specifically, a hybrid neural architecture combining a convolutional neural network and a recurrent network is used. It is empirically found that this general architecture works for both, conventional gray-level frames, and event frames. This finding is relevant because it reveals that no modification or adaptation is strictly required to deal with event data for egocentric action classification. Interestingly, action recognition is found to perform better with event frames, suggesting that these data provide discriminative information that aids the neural model to learn good features.
In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77.8% J&F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall...
详细信息
ISBN:
(纸本)9781665445092
In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77.8% J&F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance. We achieve this by elaborately compressing spatiotemporal redundancy in matching-based VOS via Pixel-Adaptive Memory (PAM). Temporally, PAM adaptively triggers memory updates on frames where objects display noteworthy inter-frame variations. Spatially, PAM selectively performs memory update and match on dynamic pixels while ignoring the static ones, significantly reducing redundant computations wasted on segmentation-irrelevant pixels. To promote efficient reference encoding, light-aggregation encoder is also introduced in SwiftNet deploying reversed sub-pixel. We hope SwiftNet could set a strong and efficient baseline for real-time VOS and facilitate its application in mobile vision.
Despite significant progress in the field, it is still challenging to create personalized visual representations that align closely with the desires and preferences of individ-ual users. This process requires users to...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Despite significant progress in the field, it is still challenging to create personalized visual representations that align closely with the desires and preferences of individ-ual users. This process requires users to articulate their ideas in words that are both comprehensible to the models and accurately capture their vision, posing difficul-ties for many users. In this paper, we tackle this challenge by leveraging historical user interactions with the system to enhance user prompts. We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users. Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs. Experimental results demonstrate the superiority of our methods over baseline approaches, as evidenced in our new offline evaluation method and online tests. Our code and dataset are available at https://***/zzjchen/Tailored-visions
We define the concept of CompositeTasking as the fusion of multiple, spatially distributed tasks, for various aspects of image understanding. Learning to perform spatially distributed tasks is motivated by the frequen...
详细信息
ISBN:
(纸本)9781665445092
We define the concept of CompositeTasking as the fusion of multiple, spatially distributed tasks, for various aspects of image understanding. Learning to perform spatially distributed tasks is motivated by the frequent availability of only sparse labels across tasks, and the desire for a compact multi-tasking network. To facilitate CompositeTasking, we introduce a novel task conditioning model - a single encoder-decoder network that performs multiple, spatially varying tasks at once. The proposed network takes an image and a set of pixel-wise dense task requests as inputs, and performs the requested prediction task for each pixel. Moreover, we also learn the composition of tasks that needs to be performed according to some CompositeTasking rules, which includes the decision of where to apply which task. It not only offers us a compact network for multi-tasking, but also allows for task-editing. Another strength of the proposed method is demonstrated by only having to supply sparse supervision per task. The obtained results are on par with our baselines that use dense supervision and a multi-headed multi-tasking design. The source code will be made publicly available at ***/nikola3794/composite-tasking.
The effectiveness of learning-based point cloud upsampling pipelines heavily relies on the upsampling modules and feature extractors used therein. For the point upsampling module, we propose a novel model called NodeS...
详细信息
ISBN:
(纸本)9781665445092
The effectiveness of learning-based point cloud upsampling pipelines heavily relies on the upsampling modules and feature extractors used therein. For the point upsampling module, we propose a novel model called NodeShuffle, which uses a Graph Convolutional Network (GCN) to better encode local point information from point neighborhoods. NodeShuffle is versatile and can be incorporated into any point cloud upsampling pipeline. Extensive experiments show how NodeShuffle consistently improves state-of-theart upsampling methods. For feature extraction, we also propose a new multi-scale point feature extractor, called Inception DenseGCN. By aggregating features at multiple scales, this feature extractor enables further performance gain in the final upsampled point clouds. We combine Inception DenseGCN with NodeShuffle into a new point upsampling pipeline called PU-GCN. PU-GCN sets new state-of-art performance with much fewer parameters and more efficient inference. Our code is publicly available at https://***/guochengqian/PU-GCN.
暂无评论