Semi-supervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge f...
详细信息
ISBN:
(纸本)1577358872
Semi-supervised learning (SSL) is a powerful tool to address the challenge of insufficient annotated data in medical segmentation problems. However, existing semi-supervised methods mainly rely on internal knowledge for pseudo labeling, which is biased due to the distribution mismatch between the highly imbalanced labeled and unlabeled data. Segmenting left atrial appendage (LAA) from transesophageal echocardiogram (TEE) images is a typical medical image segmentation task featured by scarcity of professional annotations and diverse data distributions, for which existing SSL models cannot achieve satisfactory performance. In this paper, we propose a novel strategy to mitigate the inherent challenge of distribution mismatch in SSL by, for the first time, incorporating a large foundation model (i. e. SAM in our implementation) into an SSL model to improve the quality of pseudo labels. We further propose a new self-reconstruction mechanism to generate both noise-resilient prompts to demonically improve SAM's generalization capability over TEE images and self-perturbations to stabilize the training process and reduce the impact of noisy labels. We conduct extensive experiments on an in-house TEE dataset;experimental results demonstrate that our method achieves better performance than state-of-the-art SSL models.
We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions vi...
详细信息
ISBN:
(纸本)9798350301298
We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them. To effectively disentangle each motion factor, we propose a progressive disentangled representation learning strategy by separating the factors in a coarse-to-fine manner, where we first extract unified motion feature from the driving signal, and then isolate each fine-grained motion from the unified feature. We leverage motion-specific contrastive learning and regressing for non-emotional motions, and introduce feature-level decorrelation and self-reconstruction for emotional expression, to fully utilize the inherent properties of each motion factor in unstructured video data to achieve disentanglement. Experiments show that our method provides high quality speech&lip-motion synchronization along with precise and disentangled control over multiple extra facial motions, which can hardly be achieved by previous methods.
The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features,...
详细信息
ISBN:
(纸本)9798350301298
The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection. Codes are available at https://***/lijingyao20010602/MOOD.
Ultrasound, as a safe, cost-effective, real-time imaging diagnostic tool, has been widely utilized in medical examinations. In recent years, research has focused on utilizing reinforcement learning (RL) and 3D vascula...
详细信息
ISBN:
(纸本)9798350364200;9798350364194
Ultrasound, as a safe, cost-effective, real-time imaging diagnostic tool, has been widely utilized in medical examinations. In recent years, research has focused on utilizing reinforcement learning (RL) and 3D vascular reconstruction methods to achieve robot-assisted ultrasound scanning. In robot-assisted scanning, predicting the ultrasound images and features corresponding to the robot's next action helps the agent make better action decisions and achieve scanning goals more efficiently. To this end, we propose a reinforcement learning (RL) framework using the Advantage Actor-Critic (A2C) algorithm to predict ultrasound images and incorporate a LSTM module to leverage temporal information from adjacent time points. To validate the algorithm's effectiveness, we constructed virtual and real environments to collect scanning data for agent training. In ultrasound vascular scanning, the focus is often on the relationship between the vessel's position and shape in ultrasound images as the probe's position changes. To extract relevant information from ultrasound images, we employ an ellipse fitting method for feature extraction and train a Unet network in a real environment for vessel segmentation in ultrasound images. By collecting vascular ultrasound scanning data and inputting it into the RL agent network for training, we can predict the ultrasound image information corresponding to the probe's position in the next time point, given the probe's position and ultrasound images from the previous N time points.
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
Obtaining accurate 3D object poses is vital for numerous computer vision applications, such as 3D reconstruction and scene understanding. However, annotating real-world objects is time-consuming and challenging. While synthetically generated training data is a viable alternative, the domain shift between real and synthetic data is a significant challenge. In this work, we aim to narrow the performance gap between models trained on synthetic data and fully supervised models trained on a large amount of real data. We achieve this by approaching the problem from two perspectives: 1) We introduce P3D-Diffusion, a new synthetic dataset with accurate 3D annotations generated with a graphics-guided diffusion model. 2) We propose Cross-domain 3D Consistency, CC3D, for unsupervised domain adaptation of neural mesh models. In particular, we exploit the spatial relationships between features on the mesh surface and a contrastive learning scheme to guide the domain adaptation process. Combined, these two approaches enable our models to perform competitively with state-of-the-art models using only 10% of the respective real training images, while outperforming the SOTA model by a wide margin using only 50% of the real training data. By encouraging the diversity of synthetic data and generating the images with an OOD-aware manner, our model further demonstrates robust generalization to out-of-distribution scenarios despite being trained with minimal real data. The code is available at https://***/YangYY06/synthetic_3d.
For NeRF(Neural Radiance Fields), synthesizing new views from sparse inputs poses a challenge as too few inputs can lead to artifacts in the rendered views. Recent methods have tackled this issue by introducing extern...
详细信息
In this paper, we propose a DNNs-based solution to jointly remosaic and denoise the camera raw data in Quad Bayer pattern. The traditional remosaic problem can be viewed as an interpolation process that converts the Q...
详细信息
Although deep networks based methods outperform traditional 3D reconstruction methods which require multiocular images or class labels to recover the full 3D geometry, they may produce incomplete recovery and unfaithf...
详细信息
Although deep networks based methods outperform traditional 3D reconstruction methods which require multiocular images or class labels to recover the full 3D geometry, they may produce incomplete recovery and unfaithful reconstruction when facing occluded parts of 3D objects. To address these issues, we propose Depth-preserving Latent Generative Adversarial Network (DLGAN) which consists of 3D Encoder-Decoder based GAN (EDGAN, serving as a generator and a discriminator) and Extreme Learning Machine (ELM, serving as a classifier) for 3D reconstructionfrom a monocular depth image of an object. Firstly, EDGAN decodes a latent vector from the 2.5D voxel grid representation of an input image, and generates the initial 3D occupancy grid under common GAN losses, a latent vector loss and a depth loss. For the latent vector loss, we design 3D deep AutoEncoder (AE) to learn a target latent vector from ground truth 3D voxel grid and utilize the vector to penalize the latent vector encoded from the input 2.5D data. For the depth loss, we utilize the input 2.5D data to penalize the initial 3D voxel grid from 2.5D views. Afterwards, ELM transforms float values of the initial 3D voxel grid to binary values under a binary reconstruction loss. Experimental results show that DLGAN not only outperforms several state-of-the-art methods by a large margin on both a synthetic dataset and a real-world dataset, but also predicts more occluded parts of 3D objects accurately without class labels.
作者:
Pashaie, RaminFlorida Atlantic University
Electrical and Computer Engineering Department 777 Glades Rd. Engineering East Building Room 325 Boca RatonFL33432 United States
Tomography is the process of reconstructing three dimensional images from two-dimensional projections. In general, this process has two separate phases: data acquisition and imagereconstruction. This article concentr...
详细信息
Hyperspectral image (HSI) reconstruction is about recovering a 3D HSI from its 2D snapshot measurements, to which deep models have become a promising approach. However, most existing studies train deep models on large...
详细信息
ISBN:
(纸本)9781665405409
Hyperspectral image (HSI) reconstruction is about recovering a 3D HSI from its 2D snapshot measurements, to which deep models have become a promising approach. However, most existing studies train deep models on large amounts of organized data, the collection of which can be difficult in many applications. This paper leverages the image priors encoded in untrained neural networks (NNs) to have a self-supervised learning method which is free from training datasets while adaptive to the statistics of a test sample. To induce better image priors and prevent the NN overfitting undesired solutions, we construct an unrolling-based NN equipped with fractional max pooling (FMP). Furthermore, the FMP is used with randomness to enable self-ensemble for reconstruction accuracy improvement. In the experiments, our self-supervised learning approach enjoys high-quality reconstruction and outperforms recent methods including the supervised ones.
暂无评论