检索结果-内蒙古大学图书馆

Adaptively feature matching via joint transformational-spatial clustering

学校读者我要写书评

暂无评论

Adaptively feature matching via joint transformational-spati...

作者： Wang, Linbo Tan, Li Fang, Xianyong Guo, Yanwen Wan, Shaohua MOE Key Laboratory of Intelligent Computing and Signal Processing School of Computer Science and Technology Anhui University Hefei China National Key Lab for Novel Software Technology Nanjing University Nanjing China School of Information and Safety Engineering Zhongnan University of Economics and Law Wuhan China

The transformational and spatial proximities are important cues for identifying inliers from an appearance based match set because correct matches generally stay close in input images and share similar local transformations. However, most existing approaches only check one type of them or both types consecutively with manually set thresholds, and thus their matching accuracy and flexibility in handling large-scale images are limited. In this paper, we present an efficient clustering based approach to identify match inliers with both proximities simultaneously. It first projects the putative matches into a joint transformational-spatial space, where mismatches tend to scatter all around while correct matches gather together. A mode-seeking process based on joint kernel density estimation is then proposed to obtain significant clusters in the joint space, where each cluster contains matches mapping the same object across images with high accuracy. Moreover, kernel bandwidths for measuring match proximities are adaptively set during density estimation, which enhances its applicability for matching different images. Experiments on three standard datasets show that the proposed approach delivers superior performance on a variety of feature matching tasks, including multi-object matching, duplicate object matching and object retrieval. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

关键词： Clustering Density estimation Feature matching Mode-seeking

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wang, Guanqun Wei, Xinyu Liu, Jiaming Zhang, Ray Zhang, Yichi Zhang, Kevin Chong, Maurice Zhang, Shanghang National Key Laboratory for Multimedia Information Processing School of Computer Science Peking University China Shanghai AI Lab China

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model’s prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM’s superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension. © 2024, CC BY.

关键词： Object detection

Decorrelate Irrelevant, Purify Relevant: Overcome Textual Spurious Correlations from a Feature Perspective 29

学校读者我要写书评

暂无评论

Decorrelate Irrelevant, Purify Relevant: Overcome Textual Sp...

29th International Conference on Computational Linguistics, COLING 2022

作者： Dou, Shihan Zheng, Rui Wu, Ting Gao, Songyang Shan, Junjie Zhang, Qi Wu, Yueming Huang, Xuanjing School of Computer Science Fudan University Shanghai China Shanghai Key Laboratory of Intelligent Information Processing Fudan University China KTH Royal Institute of Technology Stockholm Sweden Nanyang Technological University Singapore

Natural language understanding (NLU) models tend to rely on spurious correlations (i.e., dataset bias) to achieve high performance on in-distribution datasets but poor performance on out-of-distribution ones. Most of the existing debiasing methods often identify and weaken these samples with biased features (i.e., superficial surface features that cause such spurious correlations). However, down-weighting these samples obstructs the model in learning from the non-biased parts of these samples. To tackle this challenge, in this paper, we propose to eliminate spurious correlations in a fine-grained manner from a feature space perspective. Specifically, we introduce Random Fourier Features and weighted re-sampling to decorrelate the dependencies between features to mitigate spurious correlations. After obtaining decorrelated features, we further design a mutual-information-based method to purify them, which forces the model to learn features that are more relevant to tasks. Extensive experiments on two well-studied NLU tasks demonstrate that our method is superior to other comparative approaches. © 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.

关键词：

Cloud-Device Collaborative Learning for Multimodal Large Language Models

学校读者我要写书评

暂无评论

Cloud-Device Collaborative Learning for Multimodal Large Lan...

Conference on computer Vision and Pattern Recognition (CVPR)

作者： Guanqun Wang Jiaming Liu Chenxuan Li Yuan Zhang Junpeng Ma Xinyu Wei Kevin Zhang Maurice Chong Renrui Zhang Yijiang Liu Shanghang Zhang National Key Laboratory for Multimedia Information Processing School of Computer Science Peking University Shanghai AI Lab Nanjing University

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

The burgeoning field of Multimodal Large Language Models (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning, commonsense reasoning, and visual scene understanding. However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable de-cline in generalization capabilities when these models are compressed for device deployment. Addressing this chal-lenge, we introduce a Cloud-Device Collaborative Contin-ual Adaptation framework, designed to enhance the performance of compressed, device-deployed MLLMs by lever-aging the robust capabilities of cloud-based, larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment. In the up-link phase, we employ an Uncertainty-guided Token Sam-pling (UTS) strategy to effectively filter out-of-distribution tokens, thereby reducing transmission costs and improving training efficiency. On the cloud side, we propose Adapter-based Knowledge Distillation (AKD) method to transfer refined knowledge from large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic Weight update Compression (DWC) strategy for the down-link, which adaptively selects and quantizes updated weight parameters, enhancing transmission efficiency and reducing the representational disparity between cloud and de-vice models. Extensive experiments on several multimodal benchmarks demonstrate the superiority of our proposed framework over prior Knowledge Distillation and device-cloud collaboration methods. Notably, we also validate the feasibility of our approach to real-world experiments.

关键词： Performance evaluation Training Adaptation models Visualization Federated learning Large language models Collaboration

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Dong Li, Shimin Zhang, Xin Zhan, Jun Wang, Pengyu Zhou, Yaqian Qiu, Xipeng School of Computer Science Fudan University China Shanghai Key Laboratory of Intelligent Information Processing Fudan University China

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://***/***/. © 2023, CC BY.

关键词： Large dataset

NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Qian, Tianwen Chen, Jingjing Zhuo, Linhai Jiao, Yang Jiang, Yu-Gang Academy for Engineering and Technology Fudan University China Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University China

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated program-matically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://***/qiantianwen/NuScenes-QA. Copyright © 2023, The Authors. All rights reserved.

关键词： Autonomous vehicles

Multijugate Dual Learning for Low-Resource Task-Oriented Dialogue System

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Li, Shimin Zhang, Xiaotian Zheng, Yanjun Li, Linyang Qiu, Xipeng School of Computer Science Fudan University China Shanghai Key Laboratory of Intelligent Information Processing Fudan University China

Dialogue data in real scenarios tend to be sparsely available, rendering data-starved end-to-end dialogue systems trained inadequately. We discover that data utilization efficiency in low-resource scenarios can be enhanced by mining alignment information uncertain utterance and deterministic dialogue state. Therefore, we innovatively implement dual learning in task-oriented dialogues to exploit the correlation of heterogeneous data. In addition, the one-to-one duality is converted into a multijugate duality to reduce the influence of spurious correlations in dual training for generalization. Without introducing additional parameters, our method could be implemented in arbitrary networks. Extensive empirical analyses demonstrate that our proposed method improves the effectiveness of end-to-end task-oriented dialogue systems under multiple benchmarks and obtains state-of-the-art results in low-resource scenarios. © 2023, CC BY-NC-ND.

关键词： Speech processing

MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription

学校读者我要写书评

暂无评论

MFAE: Masked frame-level autoencoder with hybrid-supervision...

IEEE International Conference on Multimedia and Expo (ICME)

作者： Yulun Wu Jiahao Zhao Yi Yu Wei Li School of Computer Science and Technology Fudan University Shanghai China Shanghai Key Laboratory of Intelligent Information Processing Fudan University Shanghai China

Automantic Music Transcription (AMT) is an essential topic in music information retrieval (MIR), and it aims to transcribe audio recordings into symbolic representations. Recently, large-scale piano datasets with high-quality notations have been proposed for high-resolution piano transcription, which resulted in domain-specific AMT models achieved state-of- the-art results. However, those methods are hardly generalized to other ’low-resource’ instruments (such as guitar, cello, clarinet, etc.) transcription. In this paper, we propose a hybrid-supervised framework, the masked frame-level autoencoder (MFAE), to solve this issue. The proposed MFAE reconstructs the frame-level features of low-resource data to understand generic representations of low-resource instruments and improves low-resource transcription performance. Experimental results on several low- resource datasets (MAPS, MusicNet, and Guitarset) show that our framework achieves state-of-the-art performance in note-wise scores (Note F1 83.4%\64.1%\86.7%, Note-with-offset F1 59.8%\41.4%\71.6%). Moreover, our framework can be well generalized to various genres of instrument transcription, both in data-plentiful and data-limited scenarios.

关键词：

On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Gao, Songyang Dou, Shihan Zhang, Qi Huang, Xuanjing Ma, Jin Shan, Ying School of Computer Science Fudan University Shanghai China Shanghai Key Laboratory of Intelligent Information Processing Shanghai China Tencent PCG China

Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications. However, existing adversarial detection methods require access to sufficient training data, which brings noteworthy concerns regarding privacy leakage and generalizability. In this work, we validate that the adversarial sample generated by attack algorithms is strongly related to a specific vector in the high-dimensional inputs. Such vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated without original training data. Based on this discovery, we propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks, and maintains an equivalent time consumption to normal inference. © 2023, CC BY-SA.

关键词： Classification (of information)