In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured re...
详细信息
In this paper, we explore reconstructing high-quality clothed 3D humans from a single RGB-D image, assuming that virtual humans can be represented by front-view and back-view depths. Due to the scarcity of captured real RGB-D human images, we employ rendered images to train our method. However, rendered images lack background with significant depth variation in silhouettes, leading to shape prediction inaccuracies and noise. To mitigate this issue, we introduce a pseudo-multi-task framework, which incorporates a Conditional Generative Adversarial Network (CGAN) to infer back-view RGB-D images and a self-supervised masked autoencoder (MAE) to capture latent structural information of the human body. Additionally, we propose a Multi-scale Feature Fusion (MFF) module to effectively merge structural information and conditional features at various scales. Our method surpasses many existing techniques, as demonstrated through evaluations on the Thuman, RenderPeople, and BUFF datasets. Notably, our approach excels in reconstructing high-quality human models, even under challenging conditions such as complex poses and loose clothing, both on rendered and real-world images. Codes are available at https://***/Archaic-Atom/MaskRecon.
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promi...
详细信息
Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In computer vision and natural language processing, masked autoencoders (MAE) have been recognized as a powerful self-pretraining method for transformers, due to their exceptional capability to extract representative features. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective streamlined low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experimental findings show that the proposed LoMAE enhances the denoising capabilities of the transformer and substantially reduce their dependency on high-quality, ground-truth data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels. In summary, the proposed LoMAE provides promising solutions to the major issues in LDCT including interpretability, ground truth data dependency, and model robustness/generalizability.
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretraine...
详细信息
ISBN:
(纸本)9798350323726
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel mapping during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining.
Audio-visual representations leverage information from both modalities to produce joint representations. Such representations have demonstrated their usefulness in a variety of tasks. However, both modalities incorpor...
详细信息
Audio-visual representations leverage information from both modalities to produce joint representations. Such representations have demonstrated their usefulness in a variety of tasks. However, both modalities incorporated in the learned model might not necessarily be present all the time during inference. In this work, we study whether and how we can make existing models, trained under pristine conditions, robust to partial modality loss without retraining them. We propose to use a curriculum trained masked autoencoder, to impute features of missing input segments. We show that fine-tuning of classification heads with the imputed features makes the base models robust on multiple downstream tasks like emotion recognition and Lombard speech recognition. Among the 12 cases evaluated, our method outperforms strong baselines in 10 instances.
As an efficient self-supervised pre-training approach, masked autoencoder (MAE) has shown promising improvement across various 3D point cloud understanding tasks. However, the pretext task of existing point-based MAE ...
详细信息
ISBN:
(纸本)9798350390155;9798350390162
As an efficient self-supervised pre-training approach, masked autoencoder (MAE) has shown promising improvement across various 3D point cloud understanding tasks. However, the pretext task of existing point-based MAE is to reconstruct the geometry of masked points only, hence it learns features at lower semantic levels which is not appropriate for high-level downstream tasks. To address this challenge, we propose a novel self-supervised approach named Locate while Reconstructing with masked autoencoders (LR-MAE). Specifically, a multi-head decoder is designed to simultaneously localize the global position of masked patches while reconstructing masked points, aimed at learning better semantic features that align with downstream tasks. Moreover, we design a random query patch detection strategy for 3D object detection tasks in the pre-training stage, which significantly boosts the model performance with faster convergence speed. Extensive experiments show that our LR-MAE achieves superior performance on various point cloud understanding tasks. By fine-tuning on downstream datasets, LRMAE outperforms the Point-MAE baseline by 3.65% classification accuracy on the ScanObjectNN dataset, and significantly exceeds the 3DETR baseline by 6.1% AP50 on the ScanNetV2 dataset. Code is available at https://***/cathy-ji/LR-MAE.
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides onlin...
详细信息
ISBN:
(数字)9783031200564
ISBN:
(纸本)9783031200557;9783031200564
We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets;2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves 84.2% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by +0.8% under the same pre-training epochs. BootMAE also gets +1.0 mIoU improvements on semantic segmentation on ADE20K and +1.3 box AP, +1.4 mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://***/LightDXY/BootMAE.
Unsupervised learning methods have become increasingly important in deep learning due to their demonstrated large utilization of datasets and higher accuracy in computer vision and natural language processing tasks. T...
详细信息
This paper introduces Saghog, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique t...
详细信息
ISBN:
(纸本)9783031705359;9783031705366
This paper introduces Saghog, a self-supervised pretraining strategy for writer retrieval using HOG features of the binarized input image. Our preprocessing involves the application of the Segment Anything technique to extract handwriting from various datasets, ending up with about 24k documents, followed by training a vision transformer on reconstructing masked patches of the handwriting. Saghog is then finetuned by appending NetRVLAD as an encoding layer to the pretrained encoder. Evaluation of our approach on three historical datasets, Historical-WI, HisFrag20, and GRK-Papyri, demonstrates the effectiveness of Saghog for writer retrieval. Additionally, we provide ablation studies on our architecture and evaluate un- and supervised finetuning. Notably, on HisFrag20, Saghog outperforms related work with a mAP of 57.2% - a margin of 11.6% to the current state of the art, showcasing its robustness on challenging data, and is competitive on even small datasets, e.g. GRK-Papyri, where we achieve a Top-1 accuracy of 58.0%.
Self-supervised learning methods based on image patch reconstruction have witnessed great success in training auto-encoders, whose pre-trained weights can be transferred to fine-tune other downstream tasks of image un...
详细信息
ISBN:
(纸本)9783031164439;9783031164422
Self-supervised learning methods based on image patch reconstruction have witnessed great success in training auto-encoders, whose pre-trained weights can be transferred to fine-tune other downstream tasks of image understanding. However, existing methods seldom study the various importance of reconstructed patches and the symmetry of anatomical structures, when they are applied to 3D medical images. In this paper we propose a novel Attentive Symmetric Auto-encoder (ASA) based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks. We conjecture that forcing the auto-encoder to recover informative image regions can harvest more discriminative representations, than to recover smooth image patches. Then we adopt a gradient based metric to estimate the importance of each image patch. In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics. Moreover, we resort to the prior of brain structures and develop a Symmetric Position Encoding (SPE) method to better exploit the correlations between long-range but spatially symmetric regions to obtain effective features. Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models on three brain MRI segmentation benchmarks.
The development of deep learning models in medical image analysis is majorly limited by the lack of large -sized and well-annotated datasets. Unsupervised learning does not require labels and is more suitable for solv...
详细信息
The development of deep learning models in medical image analysis is majorly limited by the lack of large -sized and well-annotated datasets. Unsupervised learning does not require labels and is more suitable for solving medical image analysis problems. However, most unsupervised learning methods must be applied to large datasets. To make unsupervised learning applicable to small datasets, we proposed Swin MAE, a masked autoencoder with Swin Transformer as its backbone. Even on a dataset of only a few thousand medical images, Swin MAE can still learn useful semantic features purely from images without using any pre-trained models. It can equal or even slightly outperform the supervised model obtained by Swin Transformer trained on ImageNet in the transfer learning results of downstream tasks. Compared to MAE, Swin MAE brought a performance improvement of twice and five times for downstream tasks on BTCV and our parotid dataset, respectively. The code is publicly available at https://***/Zian-Xu/Swin-MAE.
暂无评论