We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjo...
详细信息
ISBN:
(纸本)9798350353006
We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text- to-image diffusion decoder to reconstruct the original input - a process we term De- Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples. Project page: ***.
By using few-shot data and labels, prompt learning obtains optimal prompts that are capable of achieving high performance on downstream tasks. Existing prompt learning methods generate high-quality prompts that are su...
详细信息
ISBN:
(纸本)9798350365474
By using few-shot data and labels, prompt learning obtains optimal prompts that are capable of achieving high performance on downstream tasks. Existing prompt learning methods generate high-quality prompts that are suitable for downstream tasks but tend to perform poorly in scenarios where only very limited data (e.g., one-shot) is available. We address on this challenging one-shot scenario and propose a novel architecture for prompt learning, called Image-Text Feature Alignment Branch (ITFAB). ITFAB aligns text features closer to the centroids of image features and separates text features with different classes to resolve misalignment in the feature space, thereby facilitating the acquisition of high-quality prompts with very limited data. In one-shot setting, our method outperforms the existing CoOp and CoCoOp methods and in some cases even surpasses CoCoOp's 16-shot performance. Testing on different datasets and domain, show that ITFAB almost matches CoCoOp's effectiveness. It also works with current prompt learning methods like MapLe and PromptSRC, improving their performance in one-shot setting.
Point cloud is a critical 3D representation with many emerging applications. Because of the point sparsity and irregularity, high-quality rendering of point clouds is challenging and often requires complex computation...
详细信息
ISBN:
(纸本)9798350365474
Point cloud is a critical 3D representation with many emerging applications. Because of the point sparsity and irregularity, high-quality rendering of point clouds is challenging and often requires complex computations to recover the continuous surface representation. On the other hand, to avoid visual discomfort, the motion-to-photon latency has to be very short, under 10 ms. Existing rendering solutions lack in either quality or speed. To tackle these challenges, we present a framework that unlocks interactive, free-viewing and high-fidelity point cloud rendering. We train a generic neural network to estimate 3D elliptical Gaussians from arbitrary point clouds and use differentiable surface splatting to render smooth texture and surface normal for arbitrary views. Our approach does not require per-scene optimization, and enable real-time rendering of dynamic point cloud. Experimental results demonstrate the proposed solution enjoys superior visual quality and speed, as well as generalizability to different scene content and robustness to compression artifacts.
Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that pr...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present PEEKABOO, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that PEEKABOO achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.
In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capa...
详细信息
ISBN:
(纸本)9798350353006
In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github. com/kakaobrain/honeybee.
Image copy detection is one of the pivotal tools to safeguard online information integrity. The challenge lies in determining whether a query image is an edited copy, which necessitates the identification of candidate...
详细信息
ISBN:
(纸本)9798350365474
Image copy detection is one of the pivotal tools to safeguard online information integrity. The challenge lies in determining whether a query image is an edited copy, which necessitates the identification of candidate source images through a retrieval process. The process requires discriminative features comprising of both global descriptors that are designed to be augmentation-invariant and local descriptors that can capture salient foreground objects to assess whether a query image is an edited copy of some source reference image. This work describes an end-to-end solution that leverage a vision Transformer model to learn such discriminative features and perform implicit matching between the query image and the reference image. Experimental results on two benchmark datasets demonstrate that the proposed solution outperforms state-of-the-art methods. Case studies illustrate the effectiveness of our approach in matching reference images from which the query images have been copy-edited.
Automated driving fundamentally requires knowledge about the surrounding geometry of the scene. Modern approaches use only captured images to predict occupancy maps that represent the geometry. Training these approach...
详细信息
ISBN:
(纸本)9798350353013;9798350353006
Automated driving fundamentally requires knowledge about the surrounding geometry of the scene. Modern approaches use only captured images to predict occupancy maps that represent the geometry. Training these approaches requires accurate data that may be acquired with the help of LiDAR scanners. We show that the techniques used for current benchmarks and training datasets to convert LiDAR scans into occupancy grid maps yield very low quality, and subsequently present a novel approach using evidence theory that yields more accurate reconstructions. We demonstrate that these are superior by a large margin, both qualitatively and quantitatively, and that we additionally obtain meaningful uncertainty estimates. When converting the occupancy maps back to depth estimates and comparing them with the raw LiDAR measurements, our method yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other occupancy ground-truth data. Finally, we use the improved occupancy maps to train a state-of-the-art occupancy prediction method and demonstrate that it improves the MAE by 25% on nuScenes.
Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across di...
详细信息
ISBN:
(纸本)9798350365474
Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across different crop types and varieties. Hence, this is posed as an anomaly detection task in agricultural images. Accurate anomaly detection in agricultural UAV images is vital for early identification of field irregularities. Traditional supervised learning faces challenges in adapting to diverse anomalies, necessitating extensive annotated data. In this work, we overcome this limitation with self-supervised learning using a masked image modeling approach. Masked Autoencoders (MAE) extract meaningful normal features from unlabeled image samples which produces high reconstruction error for the abnormal pixels during reconstruction. To remove the need of using only "normal" data while training, we use an anomaly suppression loss mechanism that effectively minimizes the reconstruction of anomalous pixels and allows the model to learn anomalous areas without explicitly separating "normal" images for training. Evaluation on the Agriculture-vision data challenge shows a 6.3% mIOU score improvement in comparison to prior state of the art in unsupervised and self-supervised methods. A single model generalizes across all the anomaly categories in the Agri-vision Challenge Dataset [5].
This paper discusses strategies for object detection in marine images from a practitioner's perspective working with real-world long-tail distributed datasets with a large amount of additional unlabeled data on ha...
详细信息
ISBN:
(纸本)9798350365474
This paper discusses strategies for object detection in marine images from a practitioner's perspective working with real-world long-tail distributed datasets with a large amount of additional unlabeled data on hand. The paper discusses the benefits of separating the localization and classification stages, making the case for robustness in localization through the amalgamation of additional datasets inspired by a widely used approach by practitioners in the camera-trap literature. For the classification stage, the paper compares strategies to use additional unlabeled data, comparing supervised, supervised iteratively, self-supervised, and semi-supervised pre-training approaches. Our findings reveal that semi-supervised pre-training, followed by supervised fine-tuning, yields a significantly improved balanced performance across the long-tail distribution, albeit occasionally with a trade-off in overall accuracy. These insights are validated through experiments on two real-world long-tailed underwater datasets collected by the Monterey Bay Aquarium Research Institute (MBARI).
In this research, a novel approach for autonomous spacecraft navigation, particularly in lunar contexts, is presented, focusing on vision-based techniques. The system incorporates lunar crater recognition in conjuncti...
详细信息
ISBN:
(纸本)9798350365474
In this research, a novel approach for autonomous spacecraft navigation, particularly in lunar contexts, is presented, focusing on vision-based techniques. The system incorporates lunar crater recognition in conjunction with feature tracking to enhance the accuracy of spacecraft navigation. This system underwent comprehensive evaluation through a purpose-built software simulation, replicating lunar conditions for thorough evaluation and refinement. The methodology integrates established navigational methods with advanced artificial intelligence algorithms, resulting in significant navigational accuracy. The system demonstrates precise capabilities in determining the spacecraft position, with an average accuracy of approximately 270 m for the absolute navigation mode, while the relative mode exhibited an average error of 27.4 m and 0.8 m in determining the horizontal and vertical lander displacements relative to terrain. Initial tests on embedded systems-akin to those on-board spacecraft-were conducted. These tests are pivotal in demonstrating the system's operational viability within the constraints of limited bandwidth and rapid processing requirements characteristic of space missions. The promising results from these tests suggest potential applicability in real-world space missions, enhancing autonomous navigation capabilities in lunar and potentially other extraterrestrial environments.
暂无评论