检索结果-内蒙古大学图书馆

Denoising Diffusion Path: Attribution Noise Reduction with An Auxiliary Diffusion Model 38

学校读者我要写书评

暂无评论

Denoising Diffusion Path: Attribution Noise Reduction with A...

38th Conference on Neural information processing Systems, NeurIPS 2024

作者： Lei, Yiming Li, Zilong Zhang, Junping Shan, Hongming Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University China Institute of Science and Technology for Brain-Inspired Intelligence MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence MOE Frontiers Center for Brain Science Fudan University China

The explainability of deep neural networks (DNNs) is critical for trust and reliability in AI systems. Path-based attribution methods, such as integrated gradients (IG), aim to explain predictions by accumulating gradients along a path from a baseline to the target image. However, noise accumulated during this process can significantly distort the explanation. While existing methods primarily concentrate on finding alternative paths to circumvent noise, they overlook a critical issue: intermediate-step images frequently diverge from the distribution of training data, further intensifying the impact of noise. This work presents a novel Denoising Diffusion Path (DDPath) to tackle this challenge by harnessing the power of diffusion models for denoising. By exploiting the inherent ability of diffusion models to progressively remove noise from an image, DDPath constructs a piece-wise linear path. Each segment of this path ensures that samples drawn from a Gaussian distribution are centered around the target image. This approach facilitates a gradual reduction of noise along the path. We further demonstrate that DDPath adheres to essential axiomatic properties for attribution methods and can be seamlessly integrated with existing methods such as IG. Extensive experimental results demonstrate that DDPath can significantly reduce noise in the attributions-resulting in clearer explanations-and achieves better quantitative results than traditional path-based methods. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

TOWARDS GENERATIVE ABSTRACT REASONING: COMPLETING RAVEN’S PROGRESSIVE MATRIX VIA RULE ABSTRACTION AND SELECTION

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Shi, Fan Li, Bin Xue, Xiangyang Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University China

Endowing machines with abstract reasoning ability has been a long-term research topic in artificial intelligence. Raven’s Progressive Matrix (RPM) is widely used to probe abstract visual reasoning in machine intelligence, where models will analyze the underlying rules and select one image from candidates to complete the image matrix. Participators of RPM tests can show powerful reasoning ability by inferring and combining attribute-changing rules and imagining the missing images at arbitrary positions of a matrix. However, existing solvers can hardly manifest such an ability in realistic RPM tests. In this paper, we propose a deep latent variable model for answer generation problems through Rule AbstractIon and SElection (RAISE). RAISE can encode image attributes into latent concepts and abstract atomic rules that act on the latent concepts. When generating answers, RAISE selects one atomic rule out of the global knowledge set for each latent concept to constitute the underlying rule of an RPM. In the experiments of bottom-right and arbitrary-position answer generation, RAISE outperforms the compared solvers in most configurations of realistic RPM datasets. In the odd-one-out task and two held-out configurations, RAISE can leverage acquired latent concepts and atomic rules to find the rule-breaking image in a matrix and handle problems with unseen combinations of rules and attributes. Copyright © 2024, The Authors. All rights reserved.

关键词： Abstracting

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Li, Jinglun Zhou, Xinyu Jiang, Kaixun Hong, Lingyi Guo, Pinxue Chen, Zhaoyu Ge, Weifeng Zhang, Wenqiang Shanghai Engineering Research Center of AI & Robotics Academy for Engineering & Technology Fudan University Shanghai China Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University Shanghai China Key Lab of Intelligent Information Processing School of Computer Science Fudan University Shanghai China

Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose TagOOD, a novel approach for OOD detection that leverages visionlanguage representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks. Code is available at: https://***/Jarvisgivemeasuit/tagood. Copyright © 2024, The Authors. All rights reserved.

关键词： Semantics

Optimizing V-information for Self-Supervised Pre-training Data-Effective Medical Foundation Models

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Wenxuan Zhang, Hanyu Tan, Weimin Sun, Yuqi Yan, Bo Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University China

Self-supervised pre-training medical foundation models on large-scale datasets demonstrate exceptional performance. However, recent research questions this traditional notion, exploring whether an increase in pre-training data always leads to enhanced model performance. To address this issue, data-effective learning approaches have been introduced to select valuable samples for foundation model pretraining. Notably, current methods in this area lack a clear standard for sample selection, and the underlying theoretical foundation remains unknown. As the first attempt to address this limitation, we leverage V-information in self-supervised pre-training of foundation models. Our theoretical derivation confirms that by optimizing V-information, sample selection can be framed as an optimization problem where choosing diverse and challenging samples enhances model performance even under limited training data. Under this guidance, we develop an optimal data-effective learning method (OptiDEL) to optimize V-information in real-world medical domains. The OptiDEL method generates more diverse and harder samples to achieve or even exceed the performance of models trained on the full dataset while using substantially less data. We compare the OptiDEL method with state-of-the-art approaches finding that OptiDEL consistently outperforms existing approaches across eight different datasets, with foundation models trained on only 5% of the pre-training data surpassing the performance of those trained on the full dataset. The code can be accessed at GitHub Repository. Copyright © 2024, The Authors. All rights reserved.

关键词： Self-supervised learning

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation 24

学校读者我要写书评

暂无评论

X-Prompt: Multi-modal Visual Prompt for Video Object Segment...

32nd ACM International Conference on Multimedia, MM 2024

作者： Guo, Pinxue Li, Wanyun Huang, Hao Hong, Lingyi Zhou, Xinyu Chen, Zhaoyu Li, Jinglun Jiang, Kaixun Zhang, Wei Zhang, Wenqiang Shanghai Engineering Research Center of Ai & Robotics Academy for Engineering & Technology Fudan University China Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University China Engineering Research Center of Ai & Robotics Ministry of Education Academy for Engineering & Technology China

ISBN: (纸本)9798400706868

Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Code: https://***/PinxueGuo/*** © 2024 ACM.

关键词： Image segmentation

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yu, Haiyang Fu, Teng Li, Bin Xue, Xiangyang Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University China

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training. The code and datasets are available at https://***/EAFormer/. © 2024, CC BY.

关键词： Image segmentation

Unsupervised Learning of Global Object-Centric Representations for Compositional Scene Understanding

学校读者我要写书评

暂无评论

IEEE Transactions on Visualization and computer Graphics 2025年 PP卷 PP页

作者： Chen, Tonglin Huang, Yinxuan Huang, Jinghao Li, Bin Xue, Xiangyang Fudan University Shanghai Key Lab of Intelligent Information Processing School of Computer Science Shanghai200433 China

The ability to extract invariant visual features of objects from complex scenes and identify the same objects in different scenes is inborn for humans. To endow AI systems with such capability, we introduce a novel compositional scene understanding method known as Compositional Scene understanding via Global Object-centric representations (CSGO). CSGO achieves comprehensive scene understanding, including the discovery and identification of objects, by leveraging a set of learnable global object-centric representations in an unsupervised manner. CSGO comprises three components: 1) Local Object-Centric Learning, which is responsible for extracting localized and scene-specific object-centric representations to discover objects;2) Image Decoding, facilitating the reconstruction of object and scene images using the obtained object-centric representation as input;and 3) Global Object-Centric Learning, identifying the object across diverse scenes according to a set of learnable global object-centric representations, which indicates the scene-free intrinsic attributes (i.e., appearance and shape) of objects. Experimental results on three synthetic datasets and one real-world scene dataset demonstrate that CSGO has excellent object identification and attribute disentanglement abilities. Furthermore, the scene decomposition performance (indicating object discovery performance) of CSGO is superior to comparison methods. © 1995-2012 IEEE.

关键词： Unsupervised learning

Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Huang, Yinxuan Gao, Chengmin Li, Bin Xue, Xiangyang Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University China

Given the complexities inherent in visual scenes, such as object occlusion, a comprehensive understanding often requires observation from multiple viewpoints. Existing multi-viewpoint object-centric learning methods typically employ random or sequential viewpoint selection strategies. While applicable across various scenes, these strategies may not always be ideal, as certain scenes could benefit more from specific viewpoints. To address this limitation, we propose a novel active viewpoint selection strategy. This strategy predicts images from unknown viewpoints based on information from observation images for each scene. It then compares the object-centric representations extracted from both viewpoints and selects the unknown viewpoint with the largest disparity, indicating the greatest gain in information, as the next observation viewpoint. Through experiments on various datasets, we demonstrate the effectiveness of our active viewpoint selection strategy, significantly enhancing segmentation and reconstruction performance compared to random viewpoint selection. Moreover, our method can accurately predict images from unknown viewpoints. Copyright © 2024, The Authors. All rights reserved.

关键词： Active learning

Deep-OCTA: Ensemble Deep Learning Approaches for Diabetic Retinopathy Analysis on OCTA Images 25th

学校读者我要写书评

暂无评论

Deep-OCTA: Ensemble Deep Learning Approaches for Diabetic R...

25th International Conference on Medical Image Computing and computer-Assisted Intervention , MICCAI 2022

作者： Hou, Junlin Xiao, Fan Xu, Jilan Zhang, Yuejie Zou, Haidong Feng, Rui School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Fudan University Shanghai China Academy for Engineering and Technology Fudan University Shanghai China Department of Ophthalmology Shanghai General Hospital School of Medicine Shanghai Jiao Tong University Shanghai China

ISBN: (纸本)9783031336577

The ultra-wide optical coherence tomography angiography (OCTA) has become an important imaging modality in diabetic retinopathy (DR) diagnosis. However, there are few researches focusing on automatic DR analysis using ultra-wide OCTA. In this paper, we present novel and practical deep-learning solutions based on ultra-wide OCTA for the Diabetic Retinopathy Analysis Challenge (DRAC). In the first task of segmentation of DR lesions, we utilize UNet and UNet++ to segment three lesions with strong data augmentation and model ensemble. In the second task of image quality assessment, we create an ensemble of Inception-V3, SE-ResNeXt, and Vision Transformer models. Pre-training on the large dataset as well as the hybrid MixUp and CutMix strategy are both adopted to boost the generalization ability of our models. In the third task of DR grading, we build a Vision Transformer and find that the model pre-trained on color fundus images serves as a useful substrate for OCTA images. Extensive ablation studies demonstrate the effectiveness of each designed component in our solutions. The proposed methods rank 4th, 3rd, and 5th on the three leaderboards of DRAC, respectively. Our code is publicly available at https://***/FDU-VTS/DRAC. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

关键词： Optical tomography