检索结果-内蒙古大学图书馆

IEEE International Conference on Multimedia and Expo (ICME)

作者： Lei Wang Quan Zhang Junyang Qiu Jianhuang Lai School of Computer Science and Engineering Sun Yat-sen University China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Pazhou Lab (HuangPu) Guangzhou China

ISBN: (数字)9798350390155

ISBN: (纸本)9798350390162

Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreground target rotation and background style variation. Existing AReID methods have provided limited solutions for the former, while the latter remains largely unexplored. This paper propose a Rotation Exploration Vision Transformer (RoExViT) to tackle the aforementioned dual challenges. Specifically, we design Multiple Rotation Tokens (MRT) to explore diverse rotational representations at the feature level, addressing foreground target rotation. To handle background style variation, we propose Cross-Camera Similarity (CCS) loss to effectively minimize the view gap among different cameras. Furthermore, we propose Iteratively Adaptive Batch Construction (IABC) strategy to mitigate overfitting on small datasets. Extensive experiments show that our method outperforms the state-of-the-art methods on PRAI-1581 and UAV-Human while also exhibting outstanding performance on Market1501.

关键词： Computer vision Transformers Cameras Autonomous aerial vehicles Identification of persons

来源：评论

学校读者我要写书评

暂无评论

When Prompt-based Incremental Learning Does Not Meet Strong Pretraining

arXiv

引用

arXiv 2023年

作者： Tang, Yu-Ming Peng, Yi-Xing Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Peng Cheng Laboratory Shenzhen China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequential tasks by learning task-specific prompts. However, existing prompt-based methods heavily rely on strong pretraining (typically trained on ImageNet-21k), and we find that their models could be trapped if the potential gap between the pretraining task and unknown future tasks is large. In this work, we develop a learnable Adaptive Prompt Generator (APG). The key is to unify the prompt retrieval and prompt learning processes into a learnable prompt generator. Hence, the whole prompting process can be optimized to reduce the negative effects of the gap between tasks effectively. To make our APG avoid learning ineffective knowledge, we maintain a knowledge pool to regularize APG with the feature distribution of each class. Extensive experiments show that our method significantly outperforms advanced methods in exemplar-free incremental learning without (strong) pretraining. Besides, under strong pretraining, our method also has comparable performance to existing prompt-based models, showing that our method can still benefit from pretraining. Codes can be found at https://***/TOM-tym/APG. Copyright © 2023, The Authors. All rights reserved.

关键词： Deep learning

来源：评论

学校读者我要写书评

暂无评论

IMPROVING WEAKLY SUPERVISED OBJECT LOCALIZATION BY UNCERTAINTY ESTIMATION OF PSEUDO SUPERVISION

IMPROVING WEAKLY SUPERVISED OBJECT LOCALIZATION BY UNCERTAIN...

引用

2021 IEEE International Conference on Multimedia and Expo, ICME 2021

作者： Chen, Xi Ma, Andy J. Guo, Nanxi Chen, Jiajia School of Computer Science and Engineering Sun Yat-sen University Guangzhou China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

ISBN: (纸本)9781665438643

Pseudo bounding box supervision is a promising approach for weakly supervised object localization (WSOL) with only image-level labels. However, the generated pseudo bounding boxes may be inaccurate or even completely non-overlapped with the objects of interest. In this paper, we propose to estimate the uncertainty of pseudo bounding boxes such that the negative impact caused by inaccurate estimation of pseudo supervision could be alleviated for better WSOL. The refined bounding boxes and corresponding variance uncertainties are learned by training a neural network regressor to penalize the erroneous estimations. To the best of our knowledge, this is the first work to incorporate uncertainty information of pseudo bounding boxes for WSOL. Experimental results show that our method not only outperforms previous state-of-the-art methods in CUB-200-2011 and ILSVRC datasets but also gives more precise bounding box prediction when the IoU threshold is higher. © 2021 IEEE

关键词： Uncertainty analysis

来源：评论

学校读者我要写书评

暂无评论

When Prompt-based Incremental Learning Does Not Meet Strong Pretraining

When Prompt-based Incremental Learning Does Not Meet Strong ...

引用

International Conference on Computer Vision (ICCV)

作者： Yu-Ming Tang Yi-Xing Peng Wei-Shi Zheng School of Computer Science and Engineering Sun Yat-sen University China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Peng Cheng Laboratory Shenzhen China

关键词：

来源：评论

学校读者我要写书评

暂无评论

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

arXiv

引用

arXiv 2025年

作者： Fu, Shenghao Yang, Qize Mo, Qijie Yan, Junkai Wei, Xihan Meng, Jingke Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Tongyi Lab Alibaba Group China Peng Cheng Laboratory China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Guangdong Province Key Laboratory of Information Security Technology China China

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://***/iSEE-laboratory/LLMDet. Copyright © 2025, The Authors. All rights reserved.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

Three-Dimensional Force Sensor Based on Deep Learning 1st

Three-Dimensional Force Sensor Based on Deep Learning

引用

1st International Conference on Cognitive Computation and Systems, ICCCS 2022

作者： Duan, Qingling Zhang, Qi Luo, Dong Yang, Ruofan Zhu, Chi Liu, Zhiyuan Ou, Yongsheng Shenzhen Institute of Advanced Technology Chinese Academy of Sciences Guangdong Shenzhen518055 China Key Laboratory of Human-Machine Intelligence-Synergy Systems Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China Shenzhen College of Advanced Technology University of Chinese Academy of Sciences Shenzhen China Guangdong Provincial Key Laboratory of Robotics and Intelligent System Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China Shenzhen China

ISBN: (纸本)9789819927883

Human skin can accurately sense subtle changes of both normal and shear forces. However, tactile sensors applied to robots are challenging in decoupling 3D forces due to the inability to develop adaptive models for complex soft materials. Therefore, a new soft tactile sensor has been designed in this paper to detect shear and normal forces, including a soft probe and image acquisition device. First, to capture the deformation of the sensor, colored silicone squares were embedded in the soft probe. Capcamera movement of the colored squares under external forces. The image dataset collected at different 3D forces is then input into a deep learning model. Finally, a custom miniature image device is acquired and embedded in the soft probe to miniaturize the sensor. computing results obtained from experimental datasets show that the proposed method can accurately decouple the 3D forces. Robots can grap vulnerable objects with sensors prepared at the robot’s tip. The tactile sensors studied in this paper are expected to be applied in robotics fields such as adaptive grasping, dexterous manipulation and human-computer interaction. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

关键词： Probes

来源：评论

学校读者我要写书评

暂无评论

Inserting anybody in diffusion models via celeb basis 23

Inserting anybody in diffusion models via celeb basis

引用

Proceedings of the 37th International Conference on Neural Information Processing Systems

作者： Ge Yuan Xiaodong Cun Yong Zhang Maomao Li Chenyang Qi Xintao Wang Ying Shan Huicheng Zheng School of Computer Science and Engineering Sun Yat-sen University and Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China and Guangdong Key Laboratory of Information Security Technology and Tencent AI Lab Tencent AI Lab The Hong Kong University of Science and Technology School of Computer Science and Engineering Sun Yat-sen University and Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China and Guangdong Key Laboratory of Information Security Technology

Exquisite demand exists for customizing the pretrained large text-to-image model, e.g. Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just one facial photograph and only 1024 learnable parameters under 3 minutes. So we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. Project page is at: http://***. Code is at: https://***/ygtxr1997/CelebBasis.

关键词：

来源：评论

学校读者我要写书评

暂无评论

ViSpeak: Visual Instruction Feedback in Streaming Videos

arXiv

引用

arXiv 2025年

作者： Fu, Shenghao Yang, Qize Li, Yuan-Ming Peng, Yi-Xing Lin, Kun-Yu Wei, Xihan Hu, Jian-Fang Xie, Xiaohua Zheng, Wei-Shi School of Computer Science and Engineering Sun Yat-sen University China Tongyi Lab Alibaba Group China Peng Cheng Laboratory China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China Guangdong Province Key Laboratory of Information Security Technology China China

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research. Copyright © 2025, The Authors. All rights reserved.

关键词： Video streaming

来源：评论

学校读者我要写书评

暂无评论

A Neuroevolution-Inspired Scheme for Generating Robust Internet of Things 25

A Neuroevolution-Inspired Scheme for Generating Robust Inter...

引用

25th IEEE International Conference on Computer Supported Cooperative Work in Design, CSCWD 2022

作者： Zhang, Lidi Zhang, Songwei Chen, Ning Liu, Wenyuan Zhou, Xiaobo Qiu, Tie Tianjin University College of Intelligence and Computing Tianjin300350 China Tianjin Key Laboratory of Advanced Networking Tianjin300350 China YanShan University Key Laboratory of Software Engineering of Hebei Province Qinhuangdao066004 China

ISBN: (纸本)9781665405270

Internet of Things (IoT) is growing with various applications linked in, and node failures are becoming more common as a result of malicious strikes and other issues. The cascading collapse induced by local node failures can be mitigated by robust network topology. Existing approaches for fixed topology enhance the robustness of IoT topology by reconstructing device connections. However, using existing techniques necessitates global topology optimization when new nodes are added, which takes time. To tackle this situation, this study introduces an evolutionary algorithm based on neuroevolution that generates robust IoT topology. It provides IoT topology with inherent robustness when adding extra nodes by utilizing unique mutation and crossover operators. What's more, we establish an adaptive edge density management method to reduce the rise in energy consumption caused by redundant connections when nodes join. Experimental results indicate that the proposed scheme can effectively build robust topology than multiple existing topology optimization methods in less time for diverse network sizes. © 2022 IEEE.

关键词： Network topology

来源：评论

学校读者我要写书评

暂无评论

Unsupervised 3D Face Reconstruction with Reprogramming Skip Connections

Unsupervised 3D Face Reconstruction with Reprogramming Skip ...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Zhuoming Dong Huajun Zhou Jianhuang Lai School of Computer Science and Engineering Sun Yat-sen University China Guangdong Province Key Laboratory of Information Security Technology China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

In unsupervised 3D face reconstruction, existing methods modeling the canonical face typically exclude the skip connections between encoder-decoder pairs. Consequently, they have difficulty capturing appearance details necessary for the task. However, directly applying original skip connections only induces these methods to degrade to a trivial 2D texture reconstruction algorithm. In this paper, we propose novel Reprogramming Skip Connections (RSCs), which escape from bringing about degradation and improve the 3D face reconstruction quality. Specifically, the proposed method filters out inappropriate information causing degradation by aggregating the features from the encoder in spatial dimensions into several prototypes. These prototypes preserving beneficial information are subsequently combined with the corresponding decoder features with the help of expansion masks. Further, we design the masks reconstruction consistency loss to improve the quality of the expansion masks. Our experiments verify the superiority of our method compared to other competitors.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：