In deep learning, unsupervised domain adaptation (UDA) is commonly utilized when the availability of abundant labeled data is often limited. Several methods have been proposed for UDA to overcome the difficulty of dis...
详细信息
ISBN:
(纸本)9781665488679
In deep learning, unsupervised domain adaptation (UDA) is commonly utilized when the availability of abundant labeled data is often limited. Several methods have been proposed for UDA to overcome the difficulty of distinguishing between semantically similar classes, such as person vs. rider and road vs. sidewalk. The confusion of the classes results from the collapse of the distance, caused by the domain shift, between classes in the feature space. In this work, we present a versatile approach based on text-image correlation-guided domain adaptation (TigDA), which maintains a distance to properly adjust the decision boundaries between classes in the feature space. In our approach, the feature information is extracted through text embedding of classes and the aligning capability of the text features with the image features is achieved using the cross-modality. The resultant cross-modal features play an essential role in generating pseudo-labels and calculating an auxiliary pixel-wise cross-entropy loss to assist the image encoder in learning the distribution of cross-modal features. Such a guiding process allows the extension of the distance between similar classes in feature space so that a proper distance for adjusting the decision boundary is maintained. Our TigDA achieved the highest performance among other UDA methods in both single-resolution and multi-resolution cases with the help of GTA5 and SYNTHIA for the source domain and Cityscapes for the target domain. The simplicity and versatility of TigDA will be widely applicable for enhancing the self-training capabilities of most UDA methods.
Unsupervised domain adaptation (UDA) is an important research topic in semantic segmentation tasks, wherein pixel-wise annotations are often difficult to collect in a test environment due to their high labeling costs....
详细信息
Unsupervised domain adaptation (UDA) is an important research topic in semantic segmentation tasks, wherein pixel-wise annotations are often difficult to collect in a test environment due to their high labeling costs. Previous UDA-based studies trained their segmentation networks using labeled synthetic data and unlabeled realistic data as source and target domains, respectively. However, they often fail to distinguish semantically similar classes, such as person vs. rider and road vs. sidewalk, because these classes are prone to confusion in domain-shifted environments. In this paper, we introduce a Language-Conditioned Masked Segmentation Model (LC-MSM), which is a new framework for the joint learning of context relations and domain-agnostic information for domain-adaptive semantic segmentation. Specifically, we reconstruct semantic labels with masked image conditions on the generalized text embeddings of the corresponding semantic class from OpenCLIP, which contains domain-invariant knowledge from large-scale data. To this end, we correlate the generalized text embeddings onto the per-pixel image feature of a masked image that learned the spatial context to further append domain-agnostic language information to the semantic decoder. This facilitates the generalization of our model to the target domain via the learning of context information within individual training instances, while considering cross-domain representations spanning the entire dataset. LC-MSM achieves an unprecedented UDA performance of 71.8 and 62.8 mIoU on GTA-*Cityscapes and SYNTHIA-*Cityscapes, respectively, which corresponds to an improvement of +3.5 and +1.9 percent points over the baseline method.
Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. F...
详细信息
Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. This work proposes a simple but effective method to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without additional training time nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
暂无评论