Transformer based architectures have become the common choice in natural language processing and are now achieving SOTA performance in computervision tasks such as image classification, object detection. However, the...
详细信息
ISBN:
(数字)9781665490627
ISBN:
(纸本)9781665490627
Transformer based architectures have become the common choice in natural language processing and are now achieving SOTA performance in computervision tasks such as image classification, object detection. However, the convolutional method still keeps SOTA performance in many approaches of 3D human pose estimation. Inspired by recent development in vision transformers, we design a heatmap-free structure using standard transformer architecture and learnable object queries to model the human joint relation within each frame and then output accurate joint positions and types, we also present a transformer based pose recognition architecture without any greedy algorithm to post-processing predicted bones during runtime. In the experiments, we achieve the best performance among methods that directly regress 3D joint position from a single RGB image, and report competitive results with many 2D to 3D Lifting approaches.
Audio-visual speech recognition (AVSR) is a dynamic field that has emerged at the intersection of computervision and voice processing. This paper, indepth, examines the challenges, recent advancements, and potential ...
详细信息
We investigate the problem of generating 3D meshes from single free-hand sketches, aiming at fast 3D modeling for novice users. It can be regarded as a single-view reconstruction problem, but with unique challenges, b...
详细信息
ISBN:
(纸本)9781665445092
We investigate the problem of generating 3D meshes from single free-hand sketches, aiming at fast 3D modeling for novice users. It can be regarded as a single-view reconstruction problem, but with unique challenges, brought by the variation and conciseness of sketches. Ambiguities in poorly-drawn sketches could make it hard to determine how the sketched object is posed. In this paper, we address the importance of viewpoint specification for overcoming such ambiguities, and propose a novel view-aware generation approach. By explicitly conditioning the generation process on a given viewpoint, our method can generate plausible shapes automatically with predicted viewpoints, or with specified viewpoints to help users better express their intentions. Extensive evaluations on various datasets demonstrate the effectiveness of our view-aware design in solving sketch ambiguities and improving reconstruction quality.
Existing rain image editing methods focus on either removing rain from rain images or rendering rain on rain-free images. This paper proposes to realize continuous control of rain intensity bidirectionally, from clear...
详细信息
ISBN:
(纸本)9781665445092
Existing rain image editing methods focus on either removing rain from rain images or rendering rain on rain-free images. This paper proposes to realize continuous control of rain intensity bidirectionally, from clear rain-free to downpour image with a single rain image as input, without changing the scene-specific characteristics, e.g. the direction, appearance and distribution of rain. Specifically, we introduce a Rain Intensity Controlling Network (RICNet) that contains three sub-networks of background extraction network, high-frequency rain-streak elimination network and main controlling network, which allows to control rain image of different intensities continuously by interpolation in the deep feature space. The HOG loss and autocorrelation loss are proposed to enhance consistency in orientation and suppress repetitive rain streaks. Furthermore, a decremental learning strategy that trains the network from downpour to drizzle images sequentially is proposed to further improve the performance and speedup the convergence. Extensive experiments on both rain dataset and real rain images demonstrate the effectiveness of the proposed method.
In this paper, we investigate a new variant of neural architecture search (NAS) paradigm - searching with random labels (RLNAS). The task sounds counter-intuitive for most existing NAS algorithms since random label pr...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we investigate a new variant of neural architecture search (NAS) paradigm - searching with random labels (RLNAS). The task sounds counter-intuitive for most existing NAS algorithms since random label provides few information on the performance of each candidate architecture. Instead, we propose a novel NAS framework based on ease-of-convergence hypothesis, which requires only random labels during searching. The algorithm involves two steps: first, we train a SuperNet using random labels;second, from the SuperNet we extract the sub-network whose weights change most significantly during the training. Extensive experiments are evaluated on multiple datasets (e.g. NAS-Bench-201 and ImageNet) and multiple search spaces (e.g. DARTS-like and MobileNet-like). Very surprisingly, RLNAS achieves comparable or even better results compared with state-of-the-art NAS methods such as PC-DARTS, Single Path One-Shot, even though the counterparts utilize full ground truth labels for searching. We hope our finding could inspire new understandings on the essential of NAS.
This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when pl...
详细信息
ISBN:
(纸本)9781665445092
This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottleneck. The network is constrained to learn a rich action space using, as main driving loss, a reconstruction loss on the generated video. We demonstrate the effectiveness of the proposed approach on several datasets with wide environment variety.
Recently, language-guided global image editing draws increasing attention with growing application potentials. However, previous GAN-based methods are not only confined to domain-specific, low-resolution data but also...
详细信息
ISBN:
(纸本)9781665445092
Recently, language-guided global image editing draws increasing attention with growing application potentials. However, previous GAN-based methods are not only confined to domain-specific, low-resolution data but also lacking in interpretability. To overcome the collective difficulties, we develop a text-to-operation model to map the vague editing language request into a series of editing operations, e.g., change contrast, brightness, and saturation. Each operation is interpretable and differentiable. Furthermore, the only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. Hence, we propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth. Comparison experiments on the newly collected MA5k-Req dataset and GIER dataset show the advantages of our methods. Code is available at https://***/jshi31/T2ONet.
We propose a simple yet effective reflection-free cue for robust reflection removal from a pair of flash and ambient (no-flash) images. The reflection-free cue exploits a flash-only image obtained by subtracting the a...
详细信息
ISBN:
(纸本)9781665445092
We propose a simple yet effective reflection-free cue for robust reflection removal from a pair of flash and ambient (no-flash) images. The reflection-free cue exploits a flash-only image obtained by subtracting the ambient image from the corresponding flash image in raw data space. The flash-only image is equivalent to an image taken in a dark environment with only a flash on. We observe that this flash-only image is visually reflection-free, and thus it can provide robust cues to infer the reflection in the ambient image. Since the flash-only image usually has artifacts, we further propose a dedicated model that not only utilizes the reflection-free cue but also avoids introducing artifacts, which helps accurately estimate reflection and transmission. Our experiments on real-world images with various types of reflection demonstrate the effectiveness of our model with reflection-free flash-only cues: our model outperforms state-of-the-art reflection removal approaches by more than 5.23dB in PSNR, 0.04 in SSIM, and 0.068 in LPIPS. Our source code and dataset are publicly available at ***/ChenyangLEI/flash-reflection-removal.
With the rapid growth of online examination platforms, maintaining high levels of security, integrity, and user authentication is paramount. While existing methods utilize traditional security measures, the integratio...
详细信息
In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restorat...
详细信息
ISBN:
(纸本)9781665448994
In this paper, we explore the role of Instance Normalization in low-level vision tasks. Specifically, we present a novel block: Half Instance Normalization Block (HIN Block), to boost the performance of image restoration networks. Based on HIN Block, we design a simple and powerful multi-stage network named HINet, which consists of two subnetworks. With the help of HIN Block, HINet surpasses the state-of-the-art (SOTA) on various image restoration tasks. For image denoising, we exceed it 0.11dB and 0.28 dB in PSNR on SIDD dataset, with only 7.5% and 30% of its multiplier-accumulator operations (MACs), 6.8 x and 2.9x speedup respectively. For image deblurring, we get comparable performance with 22.5% of its MACs and 3.3 x speedup on REDS and GoPro datasets. For image deraining, we exceed it by 0.3 dB in PSNR on the average result of multiple datasets with 1.4x speedup. With HINet, we won the 1st place on the NTIRE 2021 Image Deblurring Challenge - Track2. JPEG Artifacts, with a PSNR of 29.70.
暂无评论