检索结果-内蒙古大学图书馆

Diverse Consensuses Paired with Motion Estimation-Based Multi-Model Fitting 24

学校读者我要写书评

暂无评论

Diverse Consensuses Paired with Motion Estimation-Based Mult...

32nd ACM International Conference on multimedia, MM 2024

作者： Yin, Wenyu Lin, Shuyuan Lu, Yang Wang, Hanzi Fujian Key Laboratory of Sensing and Computing for Smart City School of Informatics Xiamen University Xiamen China Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China College of Cyber Security College of Information Science and Technology Jinan University Guangzhou China

ISBN: (纸本)9798400706868

Multi-model fitting aims to robustly estimate the parameters of various model instances in data contaminated by noise and outliers. Most previous works employ only a single type of consensus or implicit fusion model to represent the correlation between data points and model hypotheses. This approach often results in unrealistic and incorrect model fitting in the presence of noise and uncertainty. In this paper, we propose a novel method of diverse Consensuses paired with Motion estimation-based multi-Model Fitting (CMMF), which leverages three types of diverse consensuses along with inter-model collaboration to enhance the effectiveness of multi-model fusion. We design a Tangent Consensus Residual Reconstruction (TCRR) module to capture motion structure information of two points at the pixel level. Additionally, we introduce a Cross Consensus Affinity (CCA) framework to strengthen the correlation between data points and model hypotheses. To address the challenge of multi-body motion estimation, we propose a Nested Consensus Clustering (NCC) strategy, which formulates multi-model fitting as a motion estimation problem. It explicitly establishes motion collaboration between models and ensures that multiple models are well-fitted. Extensive quantitative and qualitative experiments are conducted on four public datasets (i.e., AdelaideRMF-F, Hopkins155, KITTI, MTPV62), and the results demonstrate that our proposed method outperforms several state-of-the-art methods. © 2024 ACM.

关键词： Motion estimation

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wu, Mingrui Ji, Jiayi Huang, Oucheng Li, Jiale Wu, Yuhang Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China

The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset’s long-tail distribution significantly impacts LVLMs’ understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information. Github: https://***/mrwu-mac/R-Bench. © 2024, CC BY.

关键词： Visual languages

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wu, Qiong Lin, Wenhao Ye, Weihao Zhou, Yiyi Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China

The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs' efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is anonymously released at https://***/DoubtedSteam/DyVTE. © 2024, CC BY.

关键词： Visual BASIC

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM 24

学校读者我要写书评

暂无评论

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

32nd ACM International Conference on multimedia, MM 2024

作者： Gao, Timin Chen, Peixian Zhang, Mengdan Fu, Chaoyou Shen, Yunhang Zhang, Yan Zhang, Shengchuan Zheng, Xiawu Sun, Xing Cao, Liujuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China Tencent Youtu Lab Shanghai China State Key Laboratory for Novel Software Technology Nanjing University China School of Intelligence Science and Technology Nanjing University China

ISBN: (纸本)9798400706868

With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, the visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations"in decision generation due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://***/cantor/. © 2024 ACM.

关键词： Chains

CaM: cache merging for memory-efficient LLMs inference 24

学校读者我要写书评

暂无评论

CaM: cache merging for memory-efficient LLMs inference

Proceedings of the 41st International Conference on Machine Learning

作者： Yuxin Zhang Yuxuan Du Gen Luo Yunshan Zhong Zhenyu Zhang Shiwei Liu Rongrong Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University and Peng Cheng Laboratory Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University University of Texas at Austin Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University and University of Oxford and Eindhoven University of Technology Institute of Artificial Intelligence Xiamen University

Despite the exceptional performance of Large Language Models (LLMs), the substantial volume of key-value (KV) pairs cached during inference presents a barrier to their efficient deployment. To ameliorate this, recent works have aimed to selectively eliminate these caches, informed by the attention scores of associated tokens. However, such cache eviction invariably leads to output perturbation, regardless of the token choice. This perturbation escalates with the compression ratio, which can precipitate a marked deterioration in LLM inference performance. This paper introduces Cache Merging (CaM) as a solution to mitigate this challenge. CaM adaptively merges to-be-evicted caches into the remaining ones, employing a novel sampling strategy governed by the prominence of attention scores within discarded locations. In this manner, CaM enables memory-efficient LLMs to preserve critical token information, even obviating the need to maintain their corresponding caches. Extensive experiments utilizing LLaMA, OPT, and GPT-NeoX across various benchmarks corroborate CaM's proficiency in bolstering the performance of memory-efficient LLMs. Code is released at https://***/zyxxmu/cam.

关键词：

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Huang, You Lan, Zongyu Cao, Liujuan Lin, Xianming Zhang, Shengchuan Jiang, Guannan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China China

The Segment Anything Model (SAM) marks a notable milestone in segmentation models, highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder, ensuring efficient real-time performance. However, SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly, the image preprocessing disables SAM to dynamically use image-level zoom-in strategies to refocus on the target object during interaction. Secondly, the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations, we propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM’s image embeddings on the target object. Dwin-MSA localizes attention computations around the target object, enhancing object-related embeddings with minimal computational overhead. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally, FocSAM augments SAM’s interactive segmentation performance to match the existing state-of-the-art method in segmentation quality, requiring only about 5.6% of this method’s inference time on CPUs. Code is available at https://***/YouHuang67/focsam. © 2024, CC BY.

关键词： Pipelines

Any-to-3D Generation via Hybrid Diffusion Supervision

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Fan, Yijun Ma, Yiwei Ji, Jiayi Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China

Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind’s broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: https://***/. Copyright © 2024, The Authors. All rights reserved.

关键词： 3D modeling

GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Qu, Yansong Dai, Shaohui Li, Xinyang Lin, Jianghang Cao, Liujuan Zhang, Shengchuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Fujian China

3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI’s superiority over previous state-of-the-art methods. Our project page is available at https://***/GOI-Hyperplane/ . Copyright © 2024, The Authors. All rights reserved.

关键词： Augmented reality

TextRefiner: Internal Visual Feature as efficient Refiner for Vision-Language Models Prompt Tuning

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Xie, Jingjing Zhang, Yuxin Peng, Jun Huang, Zhaohong Cao, Liujuan Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China

Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derived from local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at https://***/xjjxmu/TextRefiner. © 2024, CC BY.

关键词： Modeling languages