Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computervision community. However, most of the prior arts alo...
详细信息
ISBN:
(纸本)9781665445092
Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computervision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality alignment loss where both global and local information are considered. Compared with the existing methods, we introduce a global loss into the modality alignment process. The global component of the loss is driven by the identity classification. Theoretically, we show that minimizing the loss could maximize the distance between embeddings across different identities while minimizing the distance between embeddings belonging to the same identity, in a global sense (instead of a mini-batch). Targeting at (b), we propose a dynamic reweighting scheme to better explore the hard but valuable identities while filtering out the unlearnable identities. Experiments show that the proposed method outperforms the previous methods in multiple settings, including voice-face matching, verification and retrieval.
Convolutional neural networks (CNNs) have achieved significant success in the single image dehazing task. Unfortunately, most existing deep dehazing models have high computational complexity, which hinders their appli...
详细信息
ISBN:
(纸本)9781665445092
Convolutional neural networks (CNNs) have achieved significant success in the single image dehazing task. Unfortunately, most existing deep dehazing models have high computational complexity, which hinders their application to high-resolution images, especially for UHD (ultra-high-definition) or 4K resolution images. To address the problem, we propose a novel network capable of real-time dehazing of 4K images on a single GPU, which consists of three deep CNNs. The first CNN extracts haze-relevant features at a reduced resolution of the hazy input and then fits locally-affine models in the bilateral space. Another CNN is used to learn multiple full-resolution guidance maps corresponding to the learned bilateral model. As a result, the feature maps with high-frequency can be reconstructed by multi-guided bilateral upsampling. Finally, the third CNN fuses the high-quality feature maps into a dehazed image. In addition, we create a large-scale 4K image dehazing dataset to support the training and testing of compared models. Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art dehazing approaches on various benchmarks.
Multi-sensor fusion is for enhancing environment perception and 3D reconstruction in self-driving and robot navigation. Calibration between sensors is the precondition of effective multi-sensor fusion. Laborious manua...
详细信息
ISBN:
(纸本)9781665448994
Multi-sensor fusion is for enhancing environment perception and 3D reconstruction in self-driving and robot navigation. Calibration between sensors is the precondition of effective multi-sensor fusion. Laborious manual works and complex environment settings exist in old-fashioned calibration techniques for Light Detection and Ranging (LiDAR) and camera. We propose an online LiDAR-Camera Self-calibration Network (LCCNet), different from the previous CNN-based methods. LCCNet can be trained end-to-end and predict the extrinsic parameters in real-time. In the LCCNet, we exploit the cost volume layer to express the correlation between the RGB image features and the depth image projected from point clouds. Besides using the smooth L1-Loss of the predicted extrinsic calibration parameters as a supervised signal, an additional self-supervised signal, point cloud distance loss, is applied during training. Instead of directly regressing the extrinsic parameters, we predict the decalibrated deviation from initial calibration to the ground truth. The calibration error decreases further with iterative refinement and the temporal filtering approach in the inference stage. The execution time of the calibration process is 24ms for each iteration on a single GPU. LCCNet achieves a mean absolute calibration error of 0.297cm in translation and 0.017. in rotation with miscalibration magnitudes of up to +/- 1.5m and +/- 20. on the KITTI-odometry dataset, which is better than the state-of-the-art CNN-based calibration methods. The code will be publicly available at https://***/LvXudong-HIT/LCCNet
While recent studies on semi-supervised learning have shown remarkable progress in leveraging both labeled and unlabeled data, most of them presume a basic setting of the model is randomly initialized. In this work, w...
详细信息
ISBN:
(纸本)9781665445092
While recent studies on semi-supervised learning have shown remarkable progress in leveraging both labeled and unlabeled data, most of them presume a basic setting of the model is randomly initialized. In this work, we consider semi-supervised learning and transfer learning jointly, leading to a more practical and competitive paradigm that can utilize both powerful pre-trained models from source domain as well as labeled/unlabeled data in the target domain. To better exploit the value of both pre-trained weights and unlabeled target examples, we introduce adaptive consistency regularization that consists of two complementary components: Adaptive Knowledge Consistency (AKC) on the examples between the source and target model, and Adaptive Representation Consistency (ARC) on the target model between labeled and unlabeled examples. Examples involved in the consistency regularization are adaptively selected according to their potential contributions to the target task. We conduct extensive experiments on popular benchmarks including CIFAR-10, CUB-200, and MURA, by fine-tuning the ImageNet pre-trained ResNet-50 model. Results show that our proposed adaptive consistency regularization outperforms state-of-the-art semi-supervised learning techniques such as Pseudo Label, Mean Teacher, and FixMatch. Moreover, our algorithm is orthogonal to existing methods and thus able to gain additional improvements on top of MixMatch and Fix Match. Our code is available at https://***/Walleclipse/SemiSupervised-Transfer-Learning-Paddle.
The dual-pixel (DP) hardware works by splitting each pixel in half and creating an image pair in a single snapshot. Several works estimate depth/inverse depth by treating the DP pair as a stereo pair. However, dual-pi...
详细信息
ISBN:
(纸本)9781665445092
The dual-pixel (DP) hardware works by splitting each pixel in half and creating an image pair in a single snapshot. Several works estimate depth/inverse depth by treating the DP pair as a stereo pair. However, dual-pixel disparity only occurs in image regions with the defocus blur. The heavy defocus blur in DP pairs affects the performance of matching-based depth estimation approaches. Instead of removing the blur effect blindly, we study the formation of the DP pair which links the blur and the depth information. In this paper, we propose a mathematical DP model which can benefit depth estimation by the blur. These explorations motivate us to propose an end-to-end DDDNet (DP-based Depth and Deblur Network) to jointly estimate the depth and restore the image. Moreover, we define a reblur loss, which reflects the relationship of the DP image formation process with depth information, to regularise our depth estimate in training. To meet the requirement of a large amount of data for learning, we propose the first DP image simulator which allows us to create datasets with DP pairs from any existing RGBD dataset. As a side contribution, we collect a real dataset for further research. Extensive experimental evaluation on both synthetic and real datasets shows that our approach achieves competitive performance compared to state-of-the-art approaches.
The task of point cloud completion aims to predict the missing part for an incomplete 3D shape. A widely used strategy is to generate a complete point cloud from the incomplete one. However, the unordered nature of po...
详细信息
ISBN:
(纸本)9781665445092
The task of point cloud completion aims to predict the missing part for an incomplete 3D shape. A widely used strategy is to generate a complete point cloud from the incomplete one. However, the unordered nature of point clouds will degrade the generation of high-quality 3D shapes, as the detailed topology and structure of discrete points are hard to be captured by the generative process only using a latent code. In this paper, we address the above problem by reconsidering the completion task from a new perspective, where we formulate the prediction as a point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net, to mimic the behavior of an earth mover. It moves move each point of the incomplete input to complete the point cloud, where the total distance of point moving paths (PMP) should be shortest. Therefore, PMP-Net predicts a unique point moving path for each point according to the constraint of total point moving distances. As a result, the network learns a strict and unique correspondence on point-level, and thus improves the quality of the predicted complete shape. We conduct comprehensive experiments on Completion3D and PCN datasets, which demonstrate our advantages over the state-of-the-art point cloud completion methods. Code will be available at https://***/diviswen/PMP-Net.
In this paper, we propose a Point-Voxel Recurrent All-Pairs Field Transforms (PV-RAFT) method to estimate scene flow from point clouds. Since point clouds are irregular and unordered, it is challenging to efficiently ...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we propose a Point-Voxel Recurrent All-Pairs Field Transforms (PV-RAFT) method to estimate scene flow from point clouds. Since point clouds are irregular and unordered, it is challenging to efficiently extract features from all-pairs fields in the 3D space, where all-pairs correlations play important roles in scene flow estimation. To tackle this problem, we present point-voxel correlation fields, which capture both local and long-range dependencies of point pairs. To capture point-based correlations, we adopt the K-Nearest Neighbors search that preserves fine-grained information in the local region. By voxelizing point clouds in a multi-scale manner, we construct pyramid correlation voxels to model long-range correspondences. Integrating these two types of correlations, our PV-RAFT makes use of all-pairs relations to handle both small and large displacements. We evaluate the proposed method on the FlyingThings3D and KITTI Scene Flow 2015 datasets. Experimental results show that P V-RAFT outperforms state-of-the-art methods by remarkable margins.
Learning-based 3D shape segmentation is usually formulated as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of tags. This assumption, however, is impractical fo...
详细信息
ISBN:
(纸本)9781665445092
Learning-based 3D shape segmentation is usually formulated as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of tags. This assumption, however, is impractical for learning fine-grained segmentation. Although most off-the-shelf CAD models are, by construction, composed of fine-grained parts, they usually miss semantic tags and labeling those fine-grained parts is extremely tedious. We approach the problem with deep clustering, where the key idea is to learn part priors from a shape dataset with fine-grained segmentation but no part labels. Given point sampled 3D shapes, we model the clustering priors of points with a similarity matrix and achieve part segmentation through minimizing a novel low rank loss. To handle highly densely sampled point sets, we adopt a divide-and-conquer strategy. We partition the large point set into a number of blocks. Each block is segmented using a deep-clustering-based part prior network trained in a category-agnostic manner. We then train a graph convolution network to merge the segments of all blocks to form the final segmentation result. Our method is evaluated with a challenging benchmark of fine-grained segmentation, showing state-of-the-art performance.
In this paper, we present a novel unpaired point cloud completion network, named Cycle4Completion, to infer the complete geometries from a partial 3D object. Previous unpaired completion methods merely focus on the le...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we present a novel unpaired point cloud completion network, named Cycle4Completion, to infer the complete geometries from a partial 3D object. Previous unpaired completion methods merely focus on the learning of geometric correspondence from incomplete shapes to complete shapes, and ignore the learning in the reverse direction, which makes them suffer from low completion accuracy due to the limited 3D shape understanding ability. To address this problem, we propose two simultaneous cycle transformations between the latent spaces of complete shapes and incomplete ones. Specifically, the first cycle transforms shapes from incomplete domain to complete domain, and then projects them back to the incomplete domain. This process learns the geometric characteristic of complete shapes, and maintains the shape consistency between the complete prediction and the incomplete input. Similarly, the inverse cycle transformation starts from complete domain to incomplete domain, and goes back to complete domain to learn the characteristic of incomplete shapes. We experimentally show that our model with the learned bidirectional geometry correspondence outperforms state-of-the-art unpaired completion methods. Code will be available at https://***/diviswen/Cycle4Completion.
High-quality high-resolution (HR) magnetic resonance (MR) images afford more detailed information for reliable diagnosis and quantitative image analyses. Deep convolutional neural networks (CNNs) have shown promising ...
详细信息
ISBN:
(纸本)9781665445092
High-quality high-resolution (HR) magnetic resonance (MR) images afford more detailed information for reliable diagnosis and quantitative image analyses. Deep convolutional neural networks (CNNs) have shown promising ability for MR image super-resolution (SR) given low-resolution (LR) MR images. The LR MR images usually share some visual characteristics: repeating patterns, relatively simpler structures, and less informative background. Most previous CNN-based SR methods treat the spatial pixels (including the background) equally. They also fail to sense the entire space of the input, which is critical for high-quality MR image SR. To address those problems, we propose squeeze and excitation reasoning attention networks (SERAN) for accurate MR image SR. We propose to squeeze attention from global spatial information of the input and obtain global descriptors. Such global descriptors enhance the network's ability to focus on more informative regions and structures in MR images. We further build relationship among those global descriptors and propose primitive relationship reasoning attention. The global descriptors are further refined with learned attention. To fully make use of the aggregated information, we adaptively recalibrate feature responses with learned adaptive attention vectors. These attention vectors select a subset of global descriptors to complement each spatial location for accurate details and texture reconstruction. We propose squeeze and excitation attention with residual scaling, which not only stabilizes the training but also makes it flexible to other basic networks. Extensive experiments show the effectiveness of our proposed SERAN, which clearly surpasses state-of-the-art methods on benchmarks quantitatively and visually.
暂无评论