Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative pr...
ISBN:
(纸本)9798331314385
Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. Code will be available at https://***/wengzejia1/GenRec.
Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline (TPS)based rectification is widely regarded as an effective means to deal with them. Currently, the calculation of TPS trans...
详细信息
Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, cu...
详细信息
The growing interest in generating recipes from food images has drawn substantial research attention in recent years. Existing works for recipe generation primarily utilize a two-stage training method—first predictin...
详细信息
In this paper, we propose a method to predict the success of primer amplification based on the relationship existing between the sequence of primer and template, which can optimize the primer design and select the pri...
详细信息
Convolution neural networks (CNNs) and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or ...
详细信息
Convolution neural networks (CNNs) and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the taskaware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets (i.e., NYUD-v2, PASCAL-Context, and Cityscapes). For example, by using Swin-L as a backbone, our method achieves 57.55 mIoU segmentation
In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and t...
详细信息
作者:
Ling, YuTan, WeiminYan, BoSchool of Computer Science
Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China
Survival analysis aims at modeling the relationship between covariates and event occurrence with some untracked (censored) samples. In implementation, existing methods model the survival distribution with strong assum...
详细信息
Recently, unsupervised image denoising methods learning from paired noisy samples have received increasing attention. These methods build on the idea that the mean of multiple noisy images of the same scene is the ide...
详细信息
Recently, unsupervised image denoising methods learning from paired noisy samples have received increasing attention. These methods build on the idea that the mean of multiple noisy images of the same scene is the ideal clean image. However, these methods ignore the effect of Aleatoric uncertainty in the noisy image (e.g., pixels deviating from the expected distribution). The presence of Aleatoric uncertainty causes degradation of the reconstructed target pixels, resulting in high uncertainty for these pixels (i.e., low confidence), which in turn leads to sub-optimal denoising results. To address this problem, we propose a novel uncertainty-aware unsupervised image denoising method named Uncer2Natural (U2N). It dynamically predicts the Aleatoric uncertainty for each noisy sample and produces satisfactory denoising results by reducing the effect of Aleatoric uncertainty. Extensive experimental results show that U2N outperforms state-of-the- art unsupervised image denoising methods in terms of both quantitative metrics and qualitative visual quality.
The visual analysis of retinal data contributes to the understanding of a wide range of eye diseases. For the evaluation of cross-sectional studies, ophthalmologists rely on workflows and toolsets established in their...
详细信息
暂无评论