检索结果-内蒙古大学图书馆

arXiv 2025年

作者： Chen, Boyu Yue, Zhengrong Chen, Siran Wang, Zikang Liu, Yang Li, Peng Wang, Yali Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China School of Artificial Intelligence University of Chinese Academy of Sciences China Tsinghua University Beijing China Dept. of Comp. Sci. & Tech. Institute for AI Tsinghua University Beijing China Shanghai Artificial Intelligence Laboratory China Shanghai Jiao Tong University China

Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video-related questions and exchange reasons. 4) Reflection: We evaluate each agent’s performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA. © 2025, CC BY-NC-SA.

关键词： Open systems

来源：评论

学校读者我要写书评

暂无评论

Revisiting the Generalization Problem of Low-level vision Models Through the Lens of Image Deraining

arXiv

引用

arXiv 2025年

作者： Hu, Jinfan You, Zhiyuan Gu, Jinjin Zhu, Kaiwen Xue, Tianfan Dong, Chao Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen518055 China University of Chinese Academy of Sciences Beijing100049 China The Chinese University of Hong Kong 999077 Hong Kong The University of Sydney NSW2006 Australia Shanghai Jiao Tong University Shanghai200240 China Shanghai Artificial Intelligence Laboratory Shanghai200232 China Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Shenzhen University of Advanced Technology Shenzhen518055 China

Generalization remains a significant challenge for low-level vision models, which often struggle with unseen degradations in real-world scenarios despite their success in controlled benchmarks. In this paper, we revisit the generalization problem in low-level vision models. Image deraining is selected as a case study due to its well-defined and easily decoupled structure, allowing for more effective observation and analysis. Through comprehensive experiments, we reveal that the generalization issue is not primarily due to limited network capacity but rather the failure of existing training strategies, which lead networks to overfit specific degradation patterns. Our findings show that guiding networks to focus on learning the underlying image content, rather than the degradation patterns, is key to improving generalization. We demonstrate that balancing the complexity of background images and degradations in the training data helps networks better fit the image distribution. Furthermore, incorporating content priors from pre-trained generative models significantly enhances generalization. Experiments on both image deraining and image denoising validate the proposed strategies. We believe the insights and solutions will inspire further research and improve the generalization of low-level vision models. Copyright © 2025, The Authors. All rights reserved.

关键词： Image denoising

来源：评论

学校读者我要写书评

暂无评论

Tile selection method based on error minimization for photomosaic image creation

引用

Frontiers of computer Science 2021年第3期15卷 165-172页

作者： Hongbo ZHANG Xin GAO Jixiang DU Qing LEI Lijie YANG Department of Computer Science and Technology Huaqiao UniversityXiamen 361021China Fujian Key Laboratory of Big Data Intelligence and Security Huaqiao UniversityXiamen 361021China Xiamen Key Laboratory of Computer Vision and Pattern Recognition Huaqiao UniversityXiamen 361021China School of Computer Science and Technology Harbin Institute of TechnologyShenzhen 518055China

Photomosaic images are composite images composed of many small images called *** its overall visual effect,a photomosaic image is similar to the target image,and photomosaics are also called“montage art”.Noisy blocks and the loss of local information are the major obstacles in most methods or programs that create photomosaic *** solve these problems and generate a photomosaic image in this study,we propose a tile selection method based on error minimization.A photomosaic image can be generated by partitioning the target image in a rectangular pattern,selecting appropriate tile images,and then adding them with a weight *** on the principles of montage art,the quality of the generated photomosaic image can be evaluated by both global and local *** the proposed framework,via an error function analysis,the results show that selecting a tile image using a global minimum distance minimizes both the global error and the local error ***,the weight coefficient of the image superposition can be used to adjust the ratio of the global and local ***,to verify the proposed method,we built a new photomosaic creation dataset during this *** experimental results show that the proposed method achieves a low mean absolute error and that the generated photomosaic images have a more artistic effect than do the existing approaches.

关键词： photomosaic image tile image target image error minimization mean absolute error

来源：评论

学校读者我要写书评

暂无评论

Automatic motion-guided video stylization and personalization 11

Automatic motion-guided video stylization and personalizatio...

引用

19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11

作者： Cao, Chen Chen, Shifeng Zhang, Wei Tang, Xiaoou Shenzhen Key Laboratory for Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Department of Information Engineering Chinese University of Hong Kong Hong Kong

ISBN: (纸本)9781450306164

Video stylization transfers a source video into an artistic version while maintaining temporal coherence between adjacent frames. In this paper, we formulate the unsupervised example-based video stylization with Markov random field model. In our algorithm, we implement an improved optical flow algorithm to maintain temporal coherence while improve the accuracy of estimation along motion boundaries. We also extend our algorithm to the application of video personalization, in which human faces keep clear and distinguishable. A series of techniques are fused in video personalization, including face detection and alignment, motion flow, skin detection, and illumination blending. Given a source video and a style template image, our algorithm produces the stylized and/or personalized video(s) automatically. Experimental results demonstrate that our algorithm performs excellently in both video stylization and personalization. Copyright 2011 ACM.

关键词： Face recognition

来源：评论

学校读者我要写书评

暂无评论

Edge-preserving single image super-resolution 11

Edge-preserving single image super-resolution

引用

19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11

作者： Zhou, Qiang Chen, Shifeng Liu, Jianzhuang Tang, Xiaoou Shenzhen Key Laboratory for Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Department of Information Engineering Chinese University of Hong Kong Hong Kong

ISBN: (纸本)9781450306164

This paper proposes a novel approach to single image super-resolution. First, an image up-sampling scheme is proposed which takes the advantages of both bilateral filtering and mean shift image segmentation. Then we use a shock filter to enhance strong edges in the initial up-sampling result and obtain an intermediate high-resolution image. Finally, we enforce a reconstruction constraint on the high-resolution image so that fine details can be inferred by back projection. Since strong edges in the intermediate result are enhanced, ringing artifacts can be suppressed in the back projection step. We compare our algorithm with several state-of-the-art image super-resolution algorithms. Qualitative and quantitative experimental results demonstrate that our approach performs the best. Copyright 2011 ACM.

关键词： Image segmentation

来源：评论

学校读者我要写书评

暂无评论

Learning to Predict Context-Adaptive Convolution for Semantic Segmentation 16th

Learning to Predict Context-Adaptive Convolution for Semanti...

引用

16th European Conference on computer vision, ECCV 2020

作者： Liu, Jianbo He, Junjun Qiao, Yu Ren, Jimmy S. Li, Hongsheng CUHK-SenseTime Joint Laboratory The Chinese University of Hong Kong Hong Kong Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Beijing China SenseTime Research Hong Kong

ISBN: (纸本)9783030585945

Long-range contextual information is essential for achieving high-performance semantic segmentation. Previous feature re-weighting methods demonstrate that using global context for re-weighting feature channels can effectively improve the accuracy of semantic segmentation. However, the globally-sharing feature re-weighting vector might not be optimal for regions of different classes in the input image. In this paper, we propose a Context-adaptive Convolution Network (CaC-Net) to predict a spatially-varying feature weighting vector for each spatial location of the semantic feature maps. In CaC-Net, a set of context-adaptive convolution kernels are predicted from the global contextual information in a parameter-efficient manner. When used for convolution with the semantic feature maps, the predicted convolutional kernels can generate the spatially-varying feature weighting factors capturing both global and local contextual information. Comprehensive experimental results show that our CaC-Net achieves superior segmentation performance on three public datasets, PASCAL Context, PASCAL VOC 2012 and ADE20K. © 2020, Springer Nature Switzerland AG.

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

Efficient Image Super-Resolution Using Vast-Receptive-Field Attention 17th

Efficient Image Super-Resolution Using Vast-Receptive-Field ...

引用

17th European Conference on computer vision, ECCV 2022

作者： Zhou, Lin Cai, Haoming Gu, Jinjin Li, Zheyuan Liu, Yingqi Chen, Xiangyu Qiao, Yu Dong, Chao ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China Shanghai AI Laboratory Shanghai China The University of Sydney Sydney Australia University of Macau Zhuhai China

ISBN: (纸本)9783031250620

The attention mechanism plays a pivotal role in designing advanced super-resolution (SR) networks. In this work, we design an efficient SR network by improving the attention mechanism. We start from a simple pixel attention module and gradually modify it to achieve better super-resolution performance with reduced parameters. The specific approaches include: (1) increasing the receptive field of the attention branch, (2) replacing large dense convolution kernels with depthwise separable convolutions, and (3) introducing pixel normalization. These approaches paint a clear evolutionary roadmap for the design of attention mechanisms. Based on these observations, we propose VapSR, the Vast-receptive-field Pixel attention network. Experiments demonstrate the superior performance of VapSR. VapSR outperforms the present lightweight networks with even fewer parameters. And the light version of VapSR can use only 21.68% and 28.18% parameters of IMDB and RFDN to achieve similar performances to those networks. The code and models are available at https://***/zhoumumu/VapSR. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

EfficientFCN: Holistically-Guided Decoding for Semantic Segmentation 1

引用

16th European Conference on computer vision, ECCV 2020

作者： Liu, Jianbo He, Junjun Zhang, Jiawei Ren, Jimmy S. Li, Hongsheng CUHK-SenseTime Joint Laboratory The Chinese University of Hong Kong Shatin Hong Kong Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Beijing China SenseTime Research Beijing China

ISBN: (数字)9783030585747

ISBN: (纸本)9783030585730

Both performance and efficiency are important to semantic segmentation. State-of-the-art semantic segmentation algorithms are mostly based on dilated Fully Convolutional Networks (dilatedFCN), which adopt dilated convolutions in the backbone networks to extract high-resolution feature maps for achieving high-performance segmentation performance. However, due to many convolution operations are conducted on the high-resolution feature maps, such dilatedFCN-based methods result in large computational complexity and memory consumption. To balance the performance and efficiency, there also exist encoder-decoder structures that gradually recover the spatial information by combining multi-level feature maps from the encoder. However, the performances of existing encoder-decoder methods are far from comparable with the dilatedFCN-based methods. In this paper, we propose the EfficientFCN, whose backbone is a common ImageNet pretrained network without any dilated convolution. A holistically-guided decoder is introduced to obtain the high-resolution semantic-rich feature maps via the multi-scale features from the encoder. The decoding task is converted to novel codebook generation and codeword assembly task, which takes advantages of the high-level and low-level features from the encoder. Such a framework achieves comparable or even better performance than state-of-the-art methods with only 1/3 of the computational cost. Extensive experiments on PASCAL Context, PASCAL VOC, ADE20K validate the effectiveness of the proposed EfficientFCN. © 2020, Springer Nature Switzerland AG.

关键词： Semantic Segmentation

来源：评论

学校读者我要写书评

暂无评论

Rapid disparity prediction for dynamic scenes

Rapid disparity prediction for dynamic scenes

引用

9th International Symposium on Advances in Visual Computing, ISVC 2013

作者： Jiang, Jun Cheng, Jun Chen, Baowen Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Chinese University of Hong Kong Hong Kong Hong Kong Shsenzhen Institute of Information Technology China Guangdong Provincial Key Laboratory of Robotics and Intelligent System China Shenzhen Key Laboratory of Computer Vision and Pattern Recognition China

ISBN: (纸本)9783642419133

Real-time 3D sensing plays a critical role in robotic navigation, video surveillance and human-computer interaction, etc. When computing 3D structures of dynamic scenes from stereo sequences, spatiotemporal stereo and scene flow methods can produce temporally coherent disparity. However, most existing methods do not utilize the previous disparity map sufficiently to compute the next disparity map, and the searching space of correspondences limits the speed of disparity computation for each image pair. This paper proposes an effective scheme to predict disparity maps from stereo sequences. In particular, we apply a robust 3D registration algorithm based on the angular-invariant feature to estimate the ego-motion of the stereo rig between consecutive frames, and present the transformation between consecutive disparity maps. The scheme can produce a sequence of temporally coherent disparity maps rapidly. We apply the new scheme to real outdoor scenes, and thorough empirical studies indicate the effectiveness of the new scheme for practical applications. © 2013 Springer-Verlag.

关键词： Human computer interaction

来源：评论

学校读者我要写书评

暂无评论

Automatic object segmentation from large scale 3D urban point clouds through manifold embedded mode seeking 11

Automatic object segmentation from large scale 3D urban poin...

引用

19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11

作者： Yu, Zhiding Xu, Chunjing Liu, Jianzhuang Au, Oscar C. Tang, Xiaoou Shenzhen Key Laboratory for Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Department of Electronic and Computer Engineering Hong Kong University of Science and Technology Hong Kong Department of Information Engineering Chinese University of Hong Kong Hong Kong

ISBN: (纸本)9781450306164

This paper presents a system that can automatically segment objects in large scale 3D point clouds obtained from urban ranging images. The system consists of three steps: The first one involves a ground detection process that can detect relatively complex terrain and separate it from other objects. The second step superpixelizes the remaining objects to speed up the segmentation process. In the final step, a manifold embedded mode seeking method is adopted to segment the point clouds. Even though the segmentation of urban objects is a challenging problem in terms of accuracy and problem scale, our system can efficiently generate very good segmentation results. The proposed manifold learning effectively improves the segmentation performance due to the fact that continuous artificial objects often have manifold-like structures. Copyright 2011 ACM.

关键词： Image segmentation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：