Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite differe...
详细信息
ISBN:
(纸本)9798350301298
Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, Semantic image Editing by Disentangling Object and Background (SIEDOB), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds. Code is available at https://***/WuyangLuo/SIEDOB.
This research project explores a paradigm shift in perceptual enhancement by integrating a Unified recognition Framework and vision-Language Pre-Training in three-dimensional image reconstruction. Through the synergy ...
详细信息
A fast-expanding topic is the study of palmprint biometric identification in contactless scenario, which uses techniques from computervision and machine learning to identify and authenticate people. In this study, we...
详细信息
ISBN:
(纸本)9789819752119;9789819752126
A fast-expanding topic is the study of palmprint biometric identification in contactless scenario, which uses techniques from computervision and machine learning to identify and authenticate people. In this study, we utilized a handcrafted video dataset with 60 distinct classes, each labelled as either a left or right hand, to investigate palmprint detection and matching tasks. The dataset showcases various variations in palmprint patterns, like distance from the sensor, orientation, finger positioning, and deformation, making it an ideal candidate for the development of robust and accurate palmprint recognition models. The major goal of the study is to identify palmprints in the video collection and match them with the right class or pattern. To accomplish this task, different machine learning (ML) and deep learning (DL) models were trained and evaluated. To find the best method for palmprint identification in a contactless manner, the accuracy of each model was tested. In conclusion, our study adds to the expanding body of knowledge on biometric palmprint identification and introduces a fresh handmade video dataset that can be used to compare the effectiveness of various models.
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-train...
详细信息
ISBN:
(纸本)9798350301298
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation;2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher;3) Weak regularization is preferred;etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on imageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on imageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://***/OliverRensu/TinyMIM.
imageprocessing is a very fundamental technique in the field of low-level vision. However, with the development of deep learning over the past five years, most low-level vision methods tend to ignore this technique. ...
详细信息
Modern image captioning system relies heavily on extracting knowledge from images to capture the concept of a static story. In this paper, we propose a textual visual context dataset for captioning, in which the publi...
详细信息
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pr...
详细信息
ISBN:
(纸本)9798350301298
Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction. Code will be available at https://***/zdaxie/MIM-DarkSecrets.
DreamFusion [31] has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF) [23], achieving remarkable text-to-3D synthesis results. However, the met...
详细信息
ISBN:
(纸本)9798350301298
DreamFusion [31] has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF) [23], achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.
Precipitation is crucial for the future development of mankind. However, accurately predicting it remains a formidable challenge. Due to the low efficiency of traditional Numerical Weather Prediction (NWP), deep-learn...
详细信息
ISBN:
(纸本)9789819784899;9789819784905
Precipitation is crucial for the future development of mankind. However, accurately predicting it remains a formidable challenge. Due to the low efficiency of traditional Numerical Weather Prediction (NWP), deep-learning based methods are highly preferred. However, most deep learning methods focus on predicting the spatio-temporal behavior of the single precipitation variable, often ignoring the interplay between various meteorological factors and precipitation. Furthermore, they tend to underestimate it. Therefore, this paper proposes a new neural network model called Spatio-temporal Perceiving Network Based vision Transformer (ST-ViT), which integrates spatio-temporal and channel perception mechanisms to build the relationship between precipitation and meteorological elements. Additionally, an adaptive differential loss function is proposed to accurately capture precipitation intensity. We evaluated the ST-ViT on ERA5 from Southeast asia for 6h prediction. The quantitative results demonstrate that our method achieved superior accuracy and lower errors compared to other deep learning methods. Specifically, it shows great potential to alleviate the situation of underestimated precipitation from the reconstructed predicted image.
Video face recognition (VFR) has gained significant attention as a promising field combining computervision and artificial intelligence, revolutionizing identity authentication and verification. Unlike traditional im...
详细信息
ISBN:
(纸本)9789819984688;9789819984695
Video face recognition (VFR) has gained significant attention as a promising field combining computervision and artificial intelligence, revolutionizing identity authentication and verification. Unlike traditional image-based methods, VFR leverages the temporal dimension of video footage to extract comprehensive and accurate facial information. However, VFR heavily relies on robust computing power and advanced noise processing capabilities to ensure optimal recognition performance. This paper introduces a novel length-adaptive VFR framework based on a recurrent-mechanism-driven vision Transformer, termed TempoViT. TempoViT efficiently captures spatial and temporal information from face videos, enabling accurate and reliable face recognition while mitigating the high GPU memory requirements associated with video processing. By leveraging the reuse of hidden states from previous frames, the framework establishes recurring links between frames, allowing the modeling of long-term dependencies. Experimental results validate the effectiveness of TempoViT, demonstrating its state-of-the-art performance in video face recognition tasks on benchmark datasets including iQIYI-ViD, YTF, IJB-C, and Honda/UCSD.
暂无评论