检索结果-内蒙古大学图书馆

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Fu, Shenghao Yan, Junkai Yang, Qize Wei, Xihan Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Peng Cheng Laboratory Shenzhen518055 China Tongyi Lab Alibaba Group China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Guangdong Province Key Laboratory of Information Security Technology China Guangdong Guangzhou510555 China

Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector's decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector's encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector's performance while preventing the problems caused by the architecture discrepancy between the detector's backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector's backbone. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

View-decoupled Transformer for Person Re-identification unde...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Quan Zhang Lei Wang Vishal M. Patel Xiaohua Xie Jianhuang Lai School of Computer Science and Engineering Sun Yat-Sen University China Department of Electrical and Computer Engineering Johns Hopkins University USA Pazhou Lab (HuangPu) Guangdong China Guangdong Province Key Laboratory of Information Security Technology Guangdong China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features, namely hierarchical subtractive separation and orthogonal loss, where the former separates these two features inside the VDT, and the latter constrains these two to be independent. In addition, we contribute a large-scale AGPReID dataset called CARGO, consisting of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational complexity. Our project is available at https://***/LinlyAC/VDT-AGPReID.

关键词： Computer vision Cameras Transformers Pattern recognition Computational complexity Identification of persons

来源：评论

学校读者我要写书评

暂无评论

Rotation Exploration Transformer for Aerial Person Re-identification

Rotation Exploration Transformer for Aerial Person Re-identi...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Lei Wang Quan Zhang Junyang Qiu Jianhuang Lai School of Computer Science and Engineering Sun Yat-sen University China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Pazhou Lab (HuangPu) Guangzhou China

ISBN: (数字)9798350390155

ISBN: (纸本)9798350390162

Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreground target rotation and background style variation. Existing AReID methods have provided limited solutions for the former, while the latter remains largely unexplored. This paper propose a Rotation Exploration Vision Transformer (RoExViT) to tackle the aforementioned dual challenges. Specifically, we design Multiple Rotation Tokens (MRT) to explore diverse rotational representations at the feature level, addressing foreground target rotation. To handle background style variation, we propose Cross-Camera Similarity (CCS) loss to effectively minimize the view gap among different cameras. Furthermore, we propose Iteratively Adaptive Batch Construction (IABC) strategy to mitigate overfitting on small datasets. Extensive experiments show that our method outperforms the state-of-the-art methods on PRAI-1581 and UAV-Human while also exhibting outstanding performance on Market1501.

关键词： Computer vision Transformers Cameras Autonomous aerial vehicles Identification of persons

来源：评论

学校读者我要写书评

暂无评论

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Yanzuo Lu Manlin Zhang Andy J Ma Xiaohua Xie Jianhuang Lai School of Computer Science and Engineering Sun Yat-sen University Guangzhou China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Pazhou Lab (HuangPu) Guangzhou China

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose, they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper, we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts, we de-velop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages, and thus circumventing the potential over-fitting problem. To generate more realistic texture details, a hybrid- granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://***/YanzuoLu/CFLD.

关键词： Training Image synthesis Semantics Text to image Process control Diffusion models Generators

来源：评论

学校读者我要写书评

暂无评论

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large language Models

arXiv

引用

arXiv 2025年

作者： Fu, Shenghao Yang, Qize Mo, Qijie Yan, Junkai Wei, Xihan Meng, Jingke Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Tongyi Lab Alibaba Group China Peng Cheng Laboratory China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Guangdong Province Key Laboratory of Information Security Technology China China

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://***/iSEE-laboratory/LLMDet. Copyright © 2025, The Authors. All rights reserved.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

arXiv

引用

arXiv 2024年

Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector’s decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector’s encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector’s performance while preventing the problems caused by the architecture discrepancy between the detector’s backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector’s backbone. Copyright © 2024, The Authors. All rights reserved.

关键词： Decoding

来源：评论

学校读者我要写书评

暂无评论

ViSpeak: Visual Instruction Feedback in Streaming Videos

arXiv

引用

arXiv 2025年

作者： Fu, Shenghao Yang, Qize Li, Yuan-Ming Peng, Yi-Xing Lin, Kun-Yu Wei, Xihan Hu, Jian-Fang Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Tongyi Lab Alibaba Group China Peng Cheng Laboratory China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Guangdong Province Key Laboratory of Information Security Technology China China

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research. Copyright © 2025, The Authors. All rights reserved.

关键词： Video streaming

来源：评论

学校读者我要写书评

暂无评论

Frozen-DETR: enhancing DETR with image understanding from frozen foundation models 24

Frozen-DETR: enhancing DETR with image understanding from fr...

引用

Proceedings of the 38th International Conference on Neural Information Processing Systems

作者： Shenghao Fu Junkai Yan Qize Yang Xihan Wei Xiaohua Xie Wei-Shi Zheng School of Computer Science and Engineering Sun Yat-sen University China and Tongyi Lab Alibaba Group and Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Tongyi Lab Alibaba Group School of Computer Science and Engineering Sun Yat-sen University China and Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China and Guangdong Province Key Laboratory of Information Security Technology China School of Computer Science and Engineering Sun Yat-sen University China and Peng Cheng Laboratory Shenzhen China and Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China and Pazhou Laboratory (Huangpu) Guangzhou Guangdong China

ISBN: (纸本)9798331314385

关键词：

来源：评论

学校读者我要写书评

暂无评论

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

arXiv

引用

arXiv 2023年

作者： Fu, Shenghao Yan, Junkai Gao, Yipeng Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Pengcheng Lab China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy tradeoff. The code is available at https://***/iSEE-laboratory/ASAG. Copyright © 2023, The Authors. All rights reserved.

关键词： Decoding

来源：评论

学校读者我要写书评

暂无评论

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via...

引用

International Conference on Computer Vision (ICCV)

作者： Shenghao Fu Junkai Yan Yipeng Gao Xiaohua Xie Wei-Shi Zheng School of Computer Science and Engineering Sun Yat-sen University China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Pengcheng Lab China

Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at https://***/iSEE-laboratory/ASAG.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：