检索结果-内蒙古大学图书馆

38th Conference on Neural info.mation processing Systems, NeurIPS 2024

作者： Jiao, Yang Chen, Shaoxiang Jie, Zequn Chen, Jingjing Ma, Lin Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China Meituan China

Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to language-oriented formats. This adaptation leads to the convenient development of such LMMs with minimal modifications, however, it overlooks the inductive biases within diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, which decouples the learning of perception capabilities into task-agnostic and task-specific stages. Firstly, Lumen promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all vision-centric tasks we address in this paper. Afterward, the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities. © 2024 Neural info.mation processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

OmniTokenizer: A Joint Image-Video Tokenizer for visual Generation 38

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Gene...

引用

38th Conference on Neural info.mation processing Systems, NeurIPS 2024

作者： Wang, Junke Jiang, Yi Yuan, Zehuan Peng, Binyue Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China Bytedance Inc. China

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. © 2024 Neural info.mation processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

DeepStack: Deeply Stacking visual Tokens is Surprisingly Simple and Effective for LMMs 38

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Sim...

引用

38th Conference on Neural info.mation processing Systems, NeurIPS 2024

作者： Meng, Lingchen Yang, Jianwei Tian, Rui Dai, Xiyang Wu, Zuxuan Gao, Jianfeng Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Microsoft Corporation United States

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering N layers in the language and vision transformer of LMMs, we stack the visual tokens into N groups and feed each group to its aligned transformer layer from bottom to top, as illustrated in Fig. 1. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by 2.7 and 2.9 on average across 9 benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., 4.2, 11.0, and 4.0 improvements on TextVQA, DocVQA, and info.QA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, 3.8 on average compared with LLaVA-1.5-7B. © 2024 Neural info.mation processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

GenRec: Unifying Video Generation and Recognition with Diffusion Models 38

GenRec: Unifying Video Generation and Recognition with Diffu...

引用

38th Conference on Neural info.mation processing Systems, NeurIPS 2024

作者： Weng, Zejia Yang, Xitong Xing, Zhen Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Department of Computer Science University of Maryland United States

Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited info.mation. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. Code will be availab.e at https://***/wengzejia1/GenRec. © 2024 Neural info.mation processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Lumen: unleashing versatile vision-centric capabilities of large multimodal models 24

Lumen: unleashing versatile vision-centric capabilities of l...

引用

Proceedings of the 38th International Conference on Neural info.mation processing Systems

作者： Yang Jiao Shaoxiang Chen Zequn Jie Jingjing Chen Lin Ma Yu-Gang Jiang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University and Shanghai Collaborative Innovation Center on Intelligent Visual Computing and Meituan Meituan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University and Shanghai Collaborative Innovation Center on Intelligent Visual Computing

ISBN: (纸本)9798331314385

关键词：

来源：评论

学校读者我要写书评

暂无评论

Spk2ImgMamba: Spiking Camera Image Reconstruction with Multi-Scale State Space Models

Spk2ImgMamba: Spiking Camera Image Reconstruction with Multi...

引用

2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025

作者： Yin, Jiaoyang Fan, Bin Xu, Chao Huang, Tiejun Shi, Boxin School of Computer Science Peking University State Key Lab of Multimedia Info. Processing China School of Computer Science Peking University Nat'l Eng. Research Ctr. of Visual Technology China School of Intelligence Science and Technology Peking University Nat'l Key Lab of General Ai China

ISBN: (纸本)9798331510831

As a bio-inspired vision sensor, the spiking camera has showcased remarkable capability in high-speed imaging with a sampling rate of 40,000 Hz. Reconstructing clear images from continuous spike streams, which is obtained by each photosensor continuously detecting photons and firing them asynchronously, has garnered significant attention. Despite promising results, existing spike-to-image reconstruction methods face challenges in balancing global receptive fields and efficient computation due to the inherent limitations of their backbones. Recently, due to powerful long-range modeling and linear complexity, the state space model (SSM) has emerged as a competitive alternative to CNNs and Transformers. In this paper, we propose a lightweight spike-to-image reconstruction network that harnesses Mamba as the backbone. Our approach sequentially executes three core modules: temporal info.mation integration, spatial feature enhancement, and progressive image reconstruction. The former accumulates cues across diverse temporal windows to explore both long-term and short-term contexts. Subsequently, to model global dependencies while heightening local detail perception, we develop a multi-scale SSM block characterized by multi-scale multi-direction scanning, which effectively boosts spatial feature representations. Finally, intensity images are decoded progressively from the enhanced light-intensity features. Extensive experiments on both synthetic and real-captured data demonstrate that our approach achieves state-of-the-art performance, with only 10% of the network parameters and nearly two orders of magnitude less computational effort. The code will be availab.e at https://***/interstellarH/Spk2ImgMamba. © 2025 IEEE.

关键词： Bioimaging

来源：评论

学校读者我要写书评

暂无评论

FOCUS: Towards Universal Foreground Segmentation

arXiv

引用

arXiv 2025年

作者： You, Zuyao Kong, Lingyu Meng, Lingchen Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China

Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge info.mation of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics. © 2025, CC BY-NC-SA.

关键词： Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

DeepStack: deeply stacking visual tokens is surprisingly simple and effective for LMMs 24

DeepStack: deeply stacking visual tokens is surprisingly sim...

引用

Proceedings of the 38th International Conference on Neural info.mation processing Systems

作者： Lingchen Meng Jianwei Yang Rui Tian Xiyang Dai Zuxuan Wu Jianfeng Gao Yu-Gang Jiang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing Microsoft Corporation

ISBN: (纸本)9798331314385

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering N layers in the language and vision transformer of LMMs, we stack the visual tokens into N groups and feed each group to its aligned transformer layer from bottom to top, as illustrated in Fig. 1. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by 2.7 and 2.9 on average across 9 benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on highresolution tasks, e.g., 4.2, 11.0, and 4.0 improvements on TextVQA, DocVQA, and info.QA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, 3.8 on average compared with LLaVA-1.5-7B

关键词：

来源：评论

学校读者我要写书评

暂无评论

FOCUS: Towards Universal Foreground Segmentation 39

FOCUS: Towards Universal Foreground Segmentation

引用

39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

ISBN: (纸本)157735897X

关键词： Semantic Segmentation

来源：评论

学校读者我要写书评

暂无评论

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detecti...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Zhenxin Li Shiyi Lan Jose M. Alvarez Zuxuan Wu Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing NVIDIA

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their out-standing abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a “modernized” dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set.

关键词： Location awareness Three-dimensional displays Estimation Object detection Detectors Benchmark testing Transformers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：