In this paper, we propose Geometric Neural Network (GeoNet) to jointly predict depth and surface normal maps from a single image. Building on top of two-stream CNNs, our GeoNet incorporates geometric relation between ...
详细信息
ISBN:
(纸本)9781538664209
In this paper, we propose Geometric Neural Network (GeoNet) to jointly predict depth and surface normal maps from a single image. Building on top of two-stream CNNs, our GeoNet incorporates geometric relation between depth and surface normal via the new depth-to-normal and normal to -depth networks. Depth-to-normal network exploits the least square solution of surface normal from depth and improves its quality with a residual module. Normal-to-depth network, contrarily, refines the depth map based on the constraints from the surface normal through a kernel regression module, which has no parameter to learn. These two networks enforce the underlying model to efficiently predict depth and surface normal for high consistency and corresponding accuracy. Our experiments on NYU v2 dataset verify that our GeoNet is able to predict geometrically consistent depth and normal maps. It achieves top performance on surface normal estimation and is on par with state-of-theart depth estimation methods.
This paper provides a very simple yet effective character-level architecture for learning bidirectional retrieval models. Aligning multimodal content is particularly challenging considering the difficulty in finding s...
详细信息
ISBN:
(纸本)9781538664209
This paper provides a very simple yet effective character-level architecture for learning bidirectional retrieval models. Aligning multimodal content is particularly challenging considering the difficulty in finding semantic correspondence between images and descriptions. We introduce an efficient character-level inception module, designed to learn textual semantic embeddings by convolving raw characters in distinct granularity levels. Our approach is capable of explicitly encoding hierarchical information from distinct base-level representations (e.g., characters, words, and sentences) into a shared multimodal space, where it maps the semantic correspondence between images and descriptions via a contrastive pairwise loss function that minimizes order-violations. Models generated by our approach are far more robust to input noise than state-of-the-art strategies based on word-embeddings. Despite being conceptually much simpler and requiring fewer parameters, our models outperform the state-of-the-art approaches by 4.8% in the task of description retrieval and 2.7% (absolute R@ 1 values) in the task of image retrieval in the popular MS COCO retrieval dataset. We also show that our models present solid performance for text classification, specially in multilingual and noisy domains.
Human multimodal emotion recognition involves time-series data of different modalities, such as natural language, visual motions, and acoustic behaviors. Due to the variable sampling rates for sequences from different...
详细信息
ISBN:
(纸本)9781665445092
Human multimodal emotion recognition involves time-series data of different modalities, such as natural language, visual motions, and acoustic behaviors. Due to the variable sampling rates for sequences from different modalities, the collected multimodal streams are usually unaligned. The asynchrony across modalities increases the difficulty on conducting efficient multimodal fusion. Hence, this work mainly focuses on multimodal fusion from unaligned multimodal sequences. To this end, we propose the Progressive Modality Reinforcement (PMR) approach based on the recent advances of crossmodal transformer. Our approach introduces a message hub to exchange information with each modality. The message hub sends common messages to each modality and reinforces their features via crossmodal attention. In turn, it also collects the reinforced features from each modality and uses them to generate a reinforced common message. By repeating the cycle process, the common message and the modalities' features can progressively complement each other. Finally, the reinforced features are used to make predictions for human emotion. Comprehensive experiments on different human multimodal emotion recognition benchmarks clearly demonstrate the superiority of our approach.
The Fisher Linear Discriminant (FLD) is commonly used in patternrecognition. It finds a linear subspace that maximally separates class patterns according to Fisher's Criterion. Several methods of computing the FL...
详细信息
Reconstructing 3D scenes with independently moving objects from uncalibrated monocular image sequences still poses serious challenges. One important problem is to find the relative scales between these different recon...
详细信息
Reconstructing 3D scenes with independently moving objects from uncalibrated monocular image sequences still poses serious challenges. One important problem is to find the relative scales between these different reconstructed objects. The perspective reconstruction of a single object can only be known up to a certain scale which results in a one-parameter family of relative scales and trajectories per moving object. The paper formulates this ambiguity and proposes solutions in the vein of "non-accidentalness". Two instantiations of its use are analyzed: planar motion and the 'heading constraint'.
In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transform...
详细信息
ISBN:
(数字)9781665469463
ISBN:
(纸本)9781665469463
In this paper, we propose a simple attention mechanism, we call Box-Attention. It enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks. Specifically, we present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map. The BoxeR computes attention weights on these boxes by considering its grid structure. Notably, BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks. By learning invariance to rotation in the box-attention module, BoxeR-3D is capable of generating discriminative information from a bird's-eye view plane for 3D end-to-end object detection. Our experiments demonstrate that the proposed BoxeR-2D achieves state-of-the-art results on COCO detection and instance segmentation. Besides, BoxeR-3D improves over the end-to-end 3D object detection baseline and already obtains a compelling performance for the vehicle category of Waymo Open, without any class-specific optimization. Code is available at https://***/kienduynguyen/BoxeR.
The paper explores the quadratic filter for low-level vision applications. The quadratic filter is the simplest nonlinear time-invariant filter and corresponds to the second term in the Volterra expansion. We will ela...
详细信息
ISBN:
(纸本)0818638826
The paper explores the quadratic filter for low-level vision applications. The quadratic filter is the simplest nonlinear time-invariant filter and corresponds to the second term in the Volterra expansion. We will elaborate concepts behind the quadratic filter and show that it can be derived from fundamental properties of regions and the principle of model competition. We present two properties of the filter: it has better spatial localization than a linear filter and it is insensitive to blurred boundaries.
Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-lang...
详细信息
ISBN:
(纸本)9798350301298
Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic description cannot sufficiently represent the entire image. In this paper, we present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks. First, we introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class. The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution. Text representations are sampled and used to guide the dense prediction task using the probabilistic pixel-text matching loss, enhancing the stability and generalization capability of the proposed method. Extensive experiments on different dense prediction tasks and ablation studies demonstrate the effectiveness of our proposed method.
Post-hoc explanation methods, e.g., Grad-CAM, enable humans to inspect the spatial regions responsible for a particular network decision. However, it is shown that such explanations are not always consistent with huma...
详细信息
ISBN:
(数字)9781665469463
ISBN:
(纸本)9781665469463
Post-hoc explanation methods, e.g., Grad-CAM, enable humans to inspect the spatial regions responsible for a particular network decision. However, it is shown that such explanations are not always consistent with human priors, such as consistency across image transformations. Given an interpretation algorithm, e.g., Grad-CAM, we introduce a novel training method to train the model to produce more consistent explanations. Since obtaining the ground truth for a desired model interpretation is not a well-defined task, we adopt ideas from contrastive self-supervised learning, and apply them to the interpretations of the model rather than its embeddings. We show that our method, Contrastive Grad-CAM Consistency (CGC), results in Grad-CAM interpretation heatmaps that are more consistent with human annotations while still achieving comparable classification accuracy. Moreover;our method acts as a regularizes and improves the accuracy on limited-data, fine-grained classification settings. In addition, because our method does not rely on annotations, it allows for the incorporation of unlabeled data into training, which enables better generalization of the model.
This paper explores the idea of extracting three dimensional features from a previously recorded video, in an attempt to provide three dimensional information about a video clip in order to improve the performance of ...
详细信息
暂无评论