检索结果-内蒙古大学图书馆

Not All Attention is Needed: Parameter and Computation efficient Tuning for Multi-modal Large Language Models via Effective Attention Skipping

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wu, Qiong Ye, Weihao Zhou, Yiyi Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Institute of Artificial Intelligence Xiamen University 361005 China

Recently, Multi-modal Large Language Models (MLLMs) have garnered an influx of interest from both academia and industry. However, for the downstream task applications, MLLMs not only require to update a large number of parameters but also consume excessive computation. In this paper, we propose a novel parameter and computation efficient tuning method for MLLMs, termed Effective Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs) in MLLMs, the primary source of computation, are often redundant to downstream tasks. Based on this observation, EAS evaluates attention redundancy and skips the less important MHAs to speed up inference. Besides, we also propose a novel propagation-of-information adapter (PIA) to serve the attention skipping while maintaining parameter efficiency. More importantly, PIA can be further re-parameterized into feed-forward networks (FFNs) for zero-extra latency. To validate EAS, we apply it to a recently proposed MLLM called LaVIN, and conduct extensive experiments on a vision-language benchmark, namely ScienceQA. The experimental results show that EAS can not only retain the high performance of LaVIN but also reduce the updated parameters scale greatly while speeding up the inference speed to a large extent. For instance, LaVIN-EAS can obtain 89.98% accuracy while accelerating the inference speed by 2.2 times. Our code is given in https://***/DoubtedSteam/EAS © 2024, CC BY.

关键词： Visual languages

Cross-Modality Perturbation Synergy Attack for Person Re-identification 38

学校读者我要写书评

暂无评论

Cross-Modality Perturbation Synergy Attack for Person Re-ide...

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Gong, Yunpeng Zhong, Zhun Qu, Yansong Luo, Zhiming Ji, Rongrong Jiang, Min School of Informatics Xiamen University China School of Computer Science and Information Engineering Hefei University of Technology China The Department of Artificial Intelligence Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Key Laboratory of Digital Protection and Intelligent Processing of Intangible CulturalHeritage of Fujian and Taiwan Ministry of Culture and Tourism Xiamen University Fujian Xiamen361005 China

In recent years, there has been significant research focusing on addressing security concerns in single-modal person re-identification (ReID) systems that are based on RGB images. However, the safety of cross-modality scenarios, which are more commonly encountered in practical applications involving images captured by infrared cameras, has not received adequate attention. The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities. For instance, infrared images are typically grayscale, unlike visible images that contain color information. Existing attack methods have primarily focused on the characteristics of the visible image modality, overlooking the features of other modalities and the variations in data distribution among different modalities. This oversight can potentially undermine the effectiveness of these methods in image retrieval across diverse modalities. This study represents the first exploration into the security of cross-modality ReID models and proposes a universal perturbation attack specifically designed for cross-modality ReID. This attack optimizes perturbations by leveraging gradients from diverse modality data, thereby disrupting the discriminator and reinforcing the differences between modalities. We conducted experiments on three widely used cross-modality datasets, namely RegDB, SYSU, and LLCM. The results not only demonstrate the effectiveness of our method but also provide insights for future improvements in the robustness of cross-modality ReID systems. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

The dormant neuron phenomenon in multi-agent reinforcement learning value factorization 24

学校读者我要写书评

暂无评论

The dormant neuron phenomenon in multi-agent reinforcement l...

Proceedings of the 38th International Conference on Neural Information Processing Systems

作者： Haoyuan Qin Chennan Ma Mian Deng Zhengzhu Liu Songzhu Mei Xinwang Liu Cheng Wang Siqi Shen Fujian Key Laboratory of Sensing and Computing for Smart Cities School of Informatics Xiamen University (XMU) China and Key Laboratory of Multimedia Trusted Perception and Efficient Computing XMU China School of Computer National University of Defense Technology China

ISBN: (纸本)9798331314385

In this work, we study the dormant neuron phenomenon in multi-agent reinforcement learning value factorization, where the mixing network suffers from reduced network expressivity caused by an increasing number of inactive neurons. We demonstrate the presence of the dormant neuron phenomenon across multiple environments and algorithms, and show that this phenomenon negatively affects the learning process. We show that dormant neurons correlates with the existence of over-active neurons, which have large activation scores. To address the dormant neuron issue, we propose ReBorn, a simple but effective method that transfers the weights from over-active neurons to dormant neurons. We theoretically show that this method can ensure the learned action preferences are not forgotten after the weight-transferring procedure, which increases learning effectiveness. Our extensive experiments reveal that ReBorn achieves promising results across various environments and improves the performance of multiple popular value factorization approaches. The source code of ReBorn is available in https://***/xmu-rl-3dv/ReBorn.

关键词：

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ye, Weihao Wu, Qiong Lin, Wenhao Zhou, Yiyi Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Institute of Artificial Intelligence Xiamen University 361005 China

Recent progress in Multimodal Large Language Models (MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a predefined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at https://***/ywh187/FitPrune. Copyright © 2024, The Authors. All rights reserved.

关键词： Budget control

U-SAM: Upgrade Segment Anything Model With Semantic-Aware and Memory-efficient

学校读者我要写书评

暂无评论

U-SAM: Upgrade Segment Anything Model With Semantic-Aware an...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xiaofeng Jin Jie Hu Jianghang Lin Shengchuan Zhang Liujuan Cao Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University P.R. China Learning and Vision Lab National University of Singapore Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Segment Anything Model (SAM) has achieved remarkable success in the field of class-agnostic image segmentation by utilizing points or boxes as prompts. However, we identify two significant limitations when compared to traditional image segmentation models: (1) Trained in a category-agnostic interactive segmentation manner, SAM lacks the ability to discern object granularity and semantics, rendering it ineffective for traditional instance, semantic, and panoptic segmentation tasks. (2) SAM’s inefficient use of instance-independent visual features and tokens necessitates maintaining unique features and tokens for each instance, leading to excessive GPU memory consumption and diminished segmentation efficiency. To address these issues, we propose the Universal Segment Anything Model (U-SAM), a semantic-aware and memory-efficient segmentation model designed to perform both promptable and traditional segmentation tasks within a compact and unified framework. Specifically, U-SAM enhances SAM by integrating the Multi-Scale Semantic-Aware Image Encoder (S2IE), thus providing multi-scale semantic features for achieving traditional image segmentation tasks. Additionally, U-SAM is equipped with a Twin Token Mask Decoder (T2MD) which reduces GPU memory overhead by substituting replicated visual features with replicated tokens. Extensive experiments across interactive, instance, semantic, and panoptic segmentation demonstrate U-SAM’s promising results. Notably, U-SAM is 9× smaller and 10× faster than SAM, showing strong performance in zero-shot segmentation. Moreover, U-SAM surpasses the SOTA object-prompter-based model, RSPrompter, by achieving a 6.2% increase in PQ, operating 14× faster, and cutting training memory usage by 61%.

关键词： Training Image segmentation Visualization Semantics Graphics processing units Signal processing Rendering (computer graphics) Decoding Object recognition Speech processing

Proposal Distillation of Multi-Modal Feature Aggregation Network for Video Object Detection

学校读者我要写书评

暂无评论

Proposal Distillation of Multi-Modal Feature Aggregation Net...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zhenyu Qiu Qiang Qi Yang Lu Yan Yan Hanzi Wang Fujian Key Laboratory of Sensing and Computing for Smart City School of Informatics Xiamen University Xiamen China The Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China

Video object detection is a challenging task due to deteriorated object appearances. In order to bolster per-frame feature representations, one way is to aggregate features from relevant frames. However, relying exclusively on RGB modal for feature aggregation may limit the detection performance for lacking of motion robustness. We propose a novel proposal distillation of multi-modal feature aggregation network (PDMAN). Specially, it initially aligns the feature domain and flow domain via a lightweight flow module (LFM) and then facilities frame-level feature aggregation. Subsequently, a global-based semantic embedding module (GSEM) is designed to incorporate global semantic features into instance features and introduce a global multi-label classification loss to guide encoding with high class-wise responsiveness. Finally, to alleviate the presence of insufficient and redundant information in multi-modal instance-level feature aggregation, a proposal distilled aggregation module (PDAM) is employed. By distilling the instance set, this approach realizes a fine-grained feature aggregation, ultimately boosting the detection performance. Experimental results demonstrate that the proposed PDMAN achieves a favorable result on the most representative large-scale ImageNet VID dataset.

关键词：

SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Danni Ji, Jiayi Ma, Yiwei Guo, Tianyu Wang, Haowei Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University 361005 China Youtu Lab Tencent Shanghai China

In this paper, we introduce SemiRES, a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. A significant hurdle in applying semi-supervised techniques to RES is the prevalence of noisy pseudo-labels, particularly at the boundaries of objects. SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarcation, to improve the accuracy of these pseudo-labels. Within SemiRES, we offer two alternative matching strategies: IoU-based Optimal Matching (IOM) and Composite Parts Integration (CPI). These strategies are designed to extract the most accurate masks from SAM's output, thus guiding the training of the student model with enhanced precision. In instances where a precise mask cannot be matched from the available candidates, we develop the Pixel-Wise Adjustment (PWA) strategy, guiding the student model's training directly by the pseudo-labels. Extensive experiments on three RES benchmarks-RefCOCO, RefCOCO+, and G-Ref reveal its superior performance compared to fully supervised methods. Remarkably, with only 1% labeled data, our SemiRES outperforms the supervised baseline by a large margin, e.g. +18.64% gains on RefCOCO val set. The project code is available at https://***/nini0919/SemiRES. © 2024, CC BY-NC-ND.

关键词： Benchmarking

Edge Guided Network with Motion Enhancement for Few-Shot Action Recognition

学校读者我要写书评

暂无评论

IEEE Transactions on Circuits and Systems for Video Technology 2025年第6期35卷 5331-5342页

作者： Du, Kaiwen Ye, Weirong Guo, Hanyu Yan, Yan Wang, Hanzi Huawei Technologies Hangzhou310051 China Ministry of Education of China Xiamen University Fujian Key Laboratory of Sensing and Computing for Smart City School of Informatics Key Laboratory of Multimedia Trusted Perception and Efficient Computing Xiamen361005 China Shanghai Artificial Intelligence Laboratory Shanghai200232 China

Existing state-of-the-art methods for few-shot action recognition (FSAR) achieve promising performance by spatial and temporal modeling. However, most current methods ignore the importance of edge information and motion cues, leading to inferior performance. For the few-shot task, it is important to effectively explore limited data. Additionally, effectively utilizing edge information is beneficial for exploring motion cues, and vice versa. In this paper, we propose a novel edge guided network with motion enhancement (EGME) for FSAR. To the best of our knowledge, this is the first work to utilize the edge information as guidance in the FSAR task. Our EGME contains two crucial components, including an edge information extractor (EIE) and a motion enhancement module (ME). Specifically, EIE is used to obtain edge information on video frames. Afterward, the edge information is used as guidance to fuse with the frame features. In addition, ME can adaptively capture motion-sensitive features of videos. It adopts a self-gating mechanism to highlight motion-sensitive regions in videos from a large temporal receptive field. Based on the above designed components, EGME can capture edge information and motion cues, resulting in superior recognition performance. Experimental results on four challenging benchmarks show that EGME performs favorably against recent advanced methods. © 1991-2012 IEEE.

关键词： Motion capture

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Danni Dong, Ruohan Ji, Jiayi Ma, Yiwei Wang, Haowei Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University China Youtu Lab. Tencent Shanghai China

Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion’s architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model’s capability for context-aware, phrase-level understanding. Source code is available at https://***/nini0919/DiffPNG. © 2024, CC BY-NC-ND.

关键词： Zero-shot learning