In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their p...
详细信息
ISBN:
(纸本)9798350353006
In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their perfor-mance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) under-score the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://***/CrystalSixone/VLN-GOAT.
Non-isometric shape correspondence remains a fundamental challenge in computervision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic s...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
Non-isometric shape correspondence remains a fundamental challenge in computervision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in which we construct functional maps. To this end, we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and can handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably, our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise. Code is available at: https://***/
We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjo...
详细信息
ISBN:
(纸本)9798350353006
We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text- to-image diffusion decoder to reconstruct the original input - a process we term De- Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples. Project page: ***.
The advent of vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computervision. ViTs capture the global information of images through self-attention modules, which perform dot product comp...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
The advent of vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computervision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self- attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation, merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover, our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
We present a novel theory that establishes the relationship between light transport in visible and thermal infrared, and heat transport in solids. We show that heat generated due to light absorption can be estimated b...
详细信息
ISBN:
(纸本)9798350353006
We present a novel theory that establishes the relationship between light transport in visible and thermal infrared, and heat transport in solids. We show that heat generated due to light absorption can be estimated by modeling heat transport using a thermal camera. For situations where heat conduction is negligible, we analytically solve the heat transport equation to derive a simple expression relating the change in thermal image intensity to the absorbed light intensity and heat capacity of the material. Next, we prove that intrinsic image decomposition for Lambertian scenes becomes a well-posed problem if one has access to the absorbed light. Our theory generalizes to arbitrary shapes and unstructured illumination. Our theory is based on applying energy conservation principle at each pixel independently. We validate our theory using real-world experiments on diffuse objects made of different materials that exhibit both direct and global components (inter-reflections) of light transport under unknown complex lighting.
In this paper, we present an empirical study on image unfairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We demonstrate that are not equal and unfairness is prevalent for image classifi...
详细信息
ISBN:
(纸本)9798350353006
In this paper, we present an empirical study on image unfairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We demonstrate that are not equal and unfairness is prevalent for image classification models across various datasets, network and model capacities. Moreover, several intriguing properties of fairness are identified. First, the unfairness lies in problematic representation rather than classifier bias distinguished from long-tailed recognition. Second, with the proposed concept of Model Prediction Bias, investigate the origins of problematic representation training optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that more challenging to recognize. It means that more other will be confused with harder classes. Then the False (FPs) will dominate the learning in optimization, leading to their poor accuracy. Further, we conclude data augmentation and representation learning algorithms improve overall performance by promoting fairness some degree in image classification.
We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects, we first sequentially and progressively model the static back...
详细信息
ISBN:
(纸本)9798350353006
We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects, we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects, individually reconstructing each object and restoring their accurate positions and occlusion relationships within the scene. We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency. DrivingGaussian outperforms existing methods in dynamic driving scene reconstruction and enables photorealistic surround-view synthesis with high-fidelity and multi-camera consistency. Our project page is at: https://***/VDIGPKU/DrivingGaussian.
In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face ...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections, neglecting 3D statuses. To address these issues, we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities, which also achieves a higher rating in the user study.
The exploitation of visible spectrum datasets has led deep networks to show remarkable success. However, real-world tasks include low-lighting conditions which arise performance bottlenecks for models trained on large...
详细信息
ISBN:
(纸本)9798350302493
The exploitation of visible spectrum datasets has led deep networks to show remarkable success. However, real-world tasks include low-lighting conditions which arise performance bottlenecks for models trained on large-scale RGB image datasets. Thermal IR cameras are more robust against such conditions. Therefore, the usage of thermal imagery in real-world applications can be useful. Unsupervised domain adaptation (UDA) allows transferring information from a source domain to a fully unlabeled target domain. Despite substantial improvements in UDA, the performance gap between UDA and its supervised learning counterpart remains significant. By picking a small number of target samples to annotate and using them in training, active domain adaptation tries to mitigate this gap with minimum annotation expense. We propose an active domain adaptation method in order to examine the efficiency of combining the visible spectrum and thermal imagery modalities. When the domain gap is considerably large as in the visible-to-thermal task, we may conclude that the methods without explicit domain alignment cannot achieve their full potential. To this end, we propose a spectral transfer guided active domain adaptation method to select the most informative unlabeled target samples while aligning source and target domains. We used the large-scale visible spectrum dataset MS-COCO as the source domain and the thermal dataset FLIR ADAS as the target domain to present the results of our method. Extensive experimental evaluation demonstrates that our proposed method outperforms the state-of-the-art active domain adaptation methods. The code and models are publicly available.(1)
Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down t...
详细信息
ISBN:
(纸本)9798350307443
Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down to a fusion problem, where appearance and motion features need to be combined for classification and segmentation. In this paper, we present a novel fusion architecture for monocular motion segmentation - M3Former, which leverages the strong performance of transformers for segmentation and multi-modal fusion. As reconstructing motion from monocular video is ill-posed, we systematically analyze different 2D and 3D motion representations for this problem and their importance for segmentation performance. Finally, we analyze the effect of training data and show that diverse datasets are required to achieve SotA performance on Kitti and Davis. Code will be released upon publication.
暂无评论