检索结果-内蒙古大学图书馆

Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Liu, Xiangyang He, Junliang Qiu, Xipeng School of Computer Science Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing China

Large language models (LLMs) can perform complex reasoning by generating intermediate thoughts under zero-shot or few-shot settings. However, zero-shot prompting always encounters low performance, and the superior performance of few-shot prompting hinges on the manual-crafted demonstrations. In this paper, we present RoSE (Reasoning with Orchestrated Streaming Experiences), a general framework for solving reasoning tasks that can self-improve without complex external efforts. To enable RoSE, we describe an architecture that extends an LLM to store all answered questions and their thoughts in a streaming experience pool then orchestrates helpful questions from the pool to assist in answering new questions. To set up a question-aware orchestration mechanism, RoSE first calculates the similarity of each question in the pool with a new test question. Since the solution to each answered question is not always correct, RoSE will sort the questions according to their similarity with the new question, and then uniformly divide them into multiple buckets. It finally extracts one question from each bucket to make these extracted questions more diverse. To make these extracted questions help RoSE answer new questions as much as possible, we introduce two other attributes of uncertainty and complexity for each question. RoSE will preferentially select the questions with low uncertainty and high complexity from each bucket. We evaluate the versatility of RoSE in various reasoning tasks, LLMs, and CoT methods. Copyright © 2025, The Authors. All rights reserved.

关键词： Complex networks

Retrieval Augmented Recipe Generation

学校读者我要写书评

暂无评论

Retrieval Augmented Recipe Generation

IEEE Workshop on Applications of Computer Vision (WACV)

作者： Guoshan Liu Hailong Yin Bin Zhu Jingjing Chen Chong-Wah Ngo Yu-Gang Jiang Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University Shanghai Collaborative Innovation Center on Intelligent Visual Computing Singapore Management University

ISBN: (数字)9798331510831

ISBN: (纸本)9798331510848

The growing interest in generating recipes from food images has drawn substantial research attention in recent years. Existing works for recipe generation primarily utilize a two-stage training method—first predicting ingredients from a food image and then generating instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light on generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallu-cinations during recipe generation, leading to suboptimal performance. To tackle this issue, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation on the Recipe1M dataset.

关键词： Training Computer vision Accuracy Computational modeling Stochastic processes Predictive models Reliability Faces

FOCUS: Towards Universal Foreground Segmentation

学校读者我要写书评

暂无评论

arXiv 2025年

作者： You, Zuyao Kong, Lingyu Meng, Lingchen Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China

Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics. © 2025, CC BY-NC-SA.

关键词： Contrastive Learning

DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Han, Feng Chen, Kai Gong, Chao Wei, Zhipeng Chen, Jingjing Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of Computer Science Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China

The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model’s ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module’s outputs across different layers and timesteps, automatically balancing the erasure effects and model’s generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure (detecting only 34 nude parts), Cartoon Concept Removal (with an average LPIPSda of 0.428, 0.113 higher than SOTA at 0.315), and Artistic Style Erasure (with an average LPIPSda of 0.387, 0.088 higher than SOTA at 0.299), clearly outperforming alternative methods. Code is available at https://***/Maplebb/DuMo Copyright © 2025, The Author

关键词： HTTP

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Zhang, Hui Gao, Tingwei Shao, Jie Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China ByteDance Intelligent Creation China

Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality. Copyright © 2025, The Authors. All rights reserved.

关键词： Decision making

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Chen, Haoran Wang, Ping Zhou, Zihan Zhang, Xu Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China APUS AI Lab China

Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token’s attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity—both in terms of inference costs and the number of trainable parameters—but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising candidate for a general parameter-efficient fine-tuning approach. Copyright © 2025, The Authors. All rights reserved.

关键词： Contrastive Learning

Pix2Cap-COCO: Advancing visual Comprehension via Pixel-Level Captioning

学校读者我要写书评

暂无评论

arXiv 2025年

作者： You, Zuyao Wang, Junke Kong, Lingyu He, Bo Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China University of Maryland College Park United States

(Image Present) We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr (+1.4%), ROUGE (+0.4%), and SPICE (+0.5%) on visual Genome dataset, and strengthens its region understanding ability on the ViP-Bench, with an overall improvement of +5.1%, including notable increases in recognition accuracy (+11.2%) and language generation quality (+22.2%). Code is available at https://***/geshang777/pix2cap. © 2025, CC BY-NC-SA.

关键词： Pixels

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Zhang, Zihao Chen, Haoran Zhao, Haoyu Lu, Guansong Fu, Yanwei Xu, Hang Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Noah’s Ark Lab Huawei Canada

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD. Copyright © 2025, The Authors. All rights reserved.

关键词： Optical flows

Human-like conceptual representations emerge from language prediction

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Xu, Ningyu Zhang, Qi Du, Chao Luo, Qiang Qiu, Xipeng Huang, Xuanjing Zhang, Menghan School of Computer Science Fudan University Shanghai China Institute of Modern Languages and Linguistics Fudan University Shanghai China Shanghai Key Laboratory of Intelligent Information Processing Shanghai China Research Institute of Intelligent Complex Systems Fudan University Shanghai China Shanghai Collaborative Innovation Center of Intelligent Visual Computing Shanghai China Ministry of Education Key Laboratory of Contemporary Anthropology Fudan University Shanghai China

People acquire concepts through rich physical and social experiences and use them to understand the world. In contrast, large language models (LLMs), trained exclusively through next-token prediction over language data, exhibit remarkably human-like behaviors. Are these models developing concepts akin to humans, and if so, how are such concepts represented and organized? To address these questions, we reframed the classic reverse dictionary task to simulate human concept inference in context and investigated the emergence of human-like conceptual representations within LLMs. Our results demonstrate that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts. The derived representations converged towards a shared, context-independent structure that effectively predicted human behavior across key psychological phenomena, including computation of similarities, categories and semantic scales. Moreover, these representations aligned well with neural activity patterns in the human brain, even in response to visual rather than linguistic stimuli, providing evidence for biological plausibility. These findings establish that structured, human-like conceptual representations can naturally emerge from language prediction without real-world grounding. More broadly, our work positions LLMs as promising computational tools for understanding complex human cognition and paves the way for better alignment between artificial and human intelligence. Copyright © 2025, The Authors. All rights reserved.

关键词： Semantics