GAN-based image restoration inverts the generative process to repair images corrupted by known degradations. Existing unsupervised methods must be carefully tuned for each task and degradation level. In this work, we ...
详细信息
ISBN:
(纸本)9798350301298
GAN-based image restoration inverts the generative process to repair images corrupted by known degradations. Existing unsupervised methods must be carefully tuned for each task and degradation level. In this work, we make StyleGAN image restoration robust: a single set of hyperparameters works across a wide range of degradation levels. This makes it possible to handle combinations of several degradations, without the need to retune. Our proposed approach relies on a 3-phase progressive latent space extension and a conservative optimizer, which avoids the need for any additional regularization terms. Extensive experiments demonstrate robustness on inpainting, upsampling, denoising, and deartifacting at varying degradations levels, outperforming other StyleGAN-based inversion techniques. Our approach also favorably compares to diffusion-based restoration by yielding much more realistic inversion results. Code is available at the above URL.
Automatic Facial Expression recognition (FER) has attracted increasing attention in the last 20 years since facial expressions play a central role in human communication. Most FER methodologies utilize Deep Neural Net...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Automatic Facial Expression recognition (FER) has attracted increasing attention in the last 20 years since facial expressions play a central role in human communication. Most FER methodologies utilize Deep Neural Networks (DNNs) that are powerful tools when it comes to data analysis. However, despite their power, these networks are prone to overfitting, as they often tend to memorize the training data. What is more, there are not currently a lot of in-the-wild (i.e. in unconstrained environment) large databases for FER. To alleviate this issue, a number of data augmentation techniques have been proposed. Data augmentation is a way to increase the diversity of available data by applying constrained transformations on the original data. One such technique, which has positively contributed to various classification tasks, is Mixup. According to this, a DNN is trained on convex combinations of pairs of examples and their corresponding labels. In this paper, we examine the effectiveness of Mixup for in-the-wild FER in which data have large variations in head poses, illumination conditions, backgrounds and contexts. We then propose a new data augmentation strategy which is based on Mixup, called MixAugment. According to this, the network is trained concurrently on a combination of virtual examples and real examples;all these examples contribute to the overall loss function. We conduct an extensive experimental study that proves the effectiveness of MixAugment over Mixup and various state-of-the-art methods. We further investigate the combination of dropout with Mixup and MixAugment, as well as the combination of other data augmentation techniques with MixAugment.
Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much...
详细信息
ISBN:
(纸本)9798350301298
Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips existing V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.
Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships...
详细信息
ISBN:
(纸本)9798350301298
Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computervision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor classification, localization, understanding (retrieval, question answering, captioning) and generation (text-to-image synthesis) tasks. We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities. Project page: https://***
The research on action understanding has achieved significant progress with the establishment of various benchmark datasets. However, the results of action understanding are far from satisfactory in practice. One reas...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The research on action understanding has achieved significant progress with the establishment of various benchmark datasets. However, the results of action understanding are far from satisfactory in practice. One reason is that the existing action datasets ignore the existence of many hard negative samples in real-world scenarios, which are usually undefined confusion actions, e.g., holding a pen near the mouth vs. smoking. In this work, we focus on the common actions in our daily life and present a novel Common Daily Action Dataset (CDAD), which consists of 57,824 video clips of 23 well-defined common daily actions with rich manual annotations. Particularly, for each daily action, we collect not only diverse positive samples but also various hard negative samples that have minor differences (share similarities) in action with the positive ones. The established CDAD dataset could not only serve as a benchmark for several important daily action understanding tasks, including multi-label action recognition, temporal action localization, and spatial-temporal action detection, but also provide a testbed for researchers to investigate the influence of highly similar negative samples in learning action understanding models.
Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard technique for achieving state-of-the-art performance on computervision benchmarks. However, fine-tuning the whole model with million...
详细信息
ISBN:
(纸本)9798350301298
Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard technique for achieving state-of-the-art performance on computervision benchmarks. However, fine-tuning the whole model with millions of parameters is inefficient as it requires storing a same-sized new model copy for each task. In this work, we propose LoRand, a method for fine-tuning large-scale vision models with a better trade-off between task performance and the number of trainable parameters. LoRand generates tiny adapter structures with low-rank synthesis while keeping the original backbone parameters fixed, resulting in high parameter sharing. To demonstrate LoRand's effectiveness, we implement extensive experiments on object detection, semantic segmentation, and instance segmentation tasks. By only training a small percentage (1% to 3%) of the pre-trained backbone parameters, LoRand achieves comparable performance to standard fine-tuning on COCO and ADE20K and outperforms fine-tuning in low-resource PASCAL VOC dataset.
Empirical studies have shown that attention-based architectures outperform traditional convolutional neural networks (CNN) in terms of accuracy and robustness. As a result, attention-based architectures are increasing...
详细信息
Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16, 000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings.
Although the remarkable breakthrough offered by Deep Learning (DL) models in numerous computervision tasks, the need to acquire large amounts of high-quality natural data and fine-grained annotations is a shortcoming...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Although the remarkable breakthrough offered by Deep Learning (DL) models in numerous computervision tasks, the need to acquire large amounts of high-quality natural data and fine-grained annotations is a shortcoming that fundamentally increases the cost and time devoted to training these models in real-world applications. Hence, synthetic datasets are considered reliable alternatives that can reduce the data acquisition by replacing or merging with natural data or effective pre-training of the models. To this end, in this work, we propose a novel approach to integrate structural data structures with the synthetic noise structures learned by unsupervised models that mimic the noise structures in natural data. Based on the proposed approach, we introduce the Sinusoid Feature recognition (SFR) dataset, which contains hard-to-detect fixed-period sinusoid waves. While the previous works in this regard use generative models to sample synthetic data to inflate the training set, we instead apply unsupervised learning models to generate deep synthetic noise which makes training models in the proposed dataset more challenging. We evaluate the segmentation, image reconstruction, and sinusoid characterization models pre-trained or fully trained on the synthetic SFR dataset on a private dataset of grayscale Acoustic Tele-Viewer (ATV) images. Experimental results show that supervision on our proposed synthetic dataset can improve the accuracy of the models by 3-4% via pre-training, and by 17-27% via ad-hoc training while dealing with challenging, realistic real-world images.
In this paper, we propose a simple but effective architecture for fast and accurate single image super-resolution. Unlike other compact image super-resolution methods based on hand-crafted designs, we first apply coar...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
In this paper, we propose a simple but effective architecture for fast and accurate single image super-resolution. Unlike other compact image super-resolution methods based on hand-crafted designs, we first apply coarse-grained pruning for network acceleration, and then introduce collapsible linear blocks to recover the representative ability of the pruned network. Specifically, each collapsible linear block has a multi-branch topology during training, and can be equivalently replaced with a single convolution in the inference stage. Such decoupling of the training-time and inference-time architecture is implemented via a structural re-parameterization technique, leading to improved representation without introducing extra computation costs. Additionally, we adopt a two-stage training mechanism with progressively larger patch sizes to facilitate the optimization procedure. We evaluate the proposed method on the NTIRE 2022 Efficient Image Super-Resolution Challenge and achieve a good trade-off between latency and accuracy. Particularly, under the condition of limited inference time (<= 49.42ms) and parameter amount (<= 0.894M), our solution obtains the best fidelity results in terms of PSNR, i.e., 29.05dB and 28.75dB on the DIV2K validation and test sets, respectively.
暂无评论