检索结果-内蒙古大学图书馆

AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction 39

学校读者我要写书评

暂无评论

AdaCo: Overcoming Visual Foundation Model Noise in 3D Semant...

39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

作者： Zou, Pufan Zhao, Shijia Huang, Weijie Xia, Qiming Wen, Chenglu Li, Wei Wang, Cheng Fujian Key Laboratory of Sensing and Computing for Smart Cities Xiamen University China Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China Inceptio United States

ISBN: (纸本)157735897X

Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method. Copyright © 2025, Association for the Advancement of Artificial Intelligence (***). All rights reserved.

关键词： Semantic Segmentation

Semi-Supervised Panoptic Narrative Grounding

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Yang, Danni Ji, Jiayi Sun, Xiaoshuai Wang, Haowei Li, Yinan Ma, Yiwei Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Fujian Xiamen China

Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. Unlike visual segmentation tasks, PNG involves one pixel belonging to multiple open-ended nouns. As a result, existing multi-class based semi-supervised segmentation frameworks cannot be directly applied to this task. To address this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and data augmentation to determine the optimal generic configuration for the SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label quality, we propose a Quality-Based Loss Adjustment (QLA) approach to adjust the semi-supervised objective, resulting in an enhanced SSPNG-NW+. Employing our proposed QLA, we improve BCE Loss and Dice loss at pixel and mask levels, respectively. We conduct extensive experiments on PNG datasets, with our SS-PNG-NW+ demonstrating promising results comparable to fully-supervised models across all data ratios. Remarkably, our SS-PNG-NW+ outperforms fully-supervised models with only 30% and 50% supervision data, exceeding their performance by 0.8% and 1.1% respectively. This highlights the effectiveness of our proposed SS-PNG-NW+ in overcoming the challenges posed by limited annotations and enhancing the applicability of PNG tasks. The source code is available at https://***/nini0919/SSPNG. © 2023, CC BY-NC-ND.

关键词： Pixels

StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhang, Jinlu Tang, Jiji Zhang, Rongsheng Lv, Tangjie Sun, Xiaoshuai Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Fuxi AI Lab Netease Inc

Story visualization has gained increasing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and text-semantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle Copyright © 2024, The Authors. All rights reserved.

关键词： Knowledge graph

UniPTS: A Unified Framework for Proficient Post-Training Sparsity

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Xie, Jingjing Zhang, Yuxin Lin, Mingbao Lin, Zhihang Cao, Liujuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University China Tencent Youtu Lab China

Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods, however, undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset, especially at high sparsity ratios. In this paper, we attempt to reconcile this disparity by transposing three cardinal factors that profoundly alter the performance of conventional sparsity into the context of PTS. Our endeavors particularly comprise (1) A base-decayed sparsity objective that promotes efficient knowledge transferring from dense network to the sparse counterpart. (2) A reducing-regrowing search algorithm designed to ascertain the optimal sparsity distribution while circumventing overfitting to the small calibration set in PTS. (3) The employment of dynamic sparse training predicated on the preceding aspects, aimed at comprehensively optimizing the sparsity structure while ensuring training stability. Our proposed framework, termed UniPTS, is validated to be much superior to existing PTS methods across extensive benchmarks. As an illustration, it amplifies the performance of POT, a recently proposed recipe, from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity ratio on ImageNet. We release the code of our paper at https://***/xjjxmu/UniPTS. © 2024, CC0.

关键词：

DMAD: Dual Memory Bank for Real-World Anomaly Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Hu, Jianlong Chen, Xu Gan, Zhenye Peng, Jinlong Zhang, Shengchuan Zhang, Jiangning Wang, Yabiao Wang, Chengjie Cao, Liujuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University China Youtu Lab Tencent China

Training a unified model is considered to be more suitable for practical industrial anomaly detection scenarios due to its generalization ability and storage efficiency. However, this multi-class setting, which exclusively uses normal data, overlooks the few but important accessible annotated anomalies in the real world. To address the challenge of real-world anomaly detection, we propose a new framework named Dual Memory bank enhanced representation learning for Anomaly Detection (DMAD). This framework handles both unsupervised and semi-supervised scenarios in a unified (multi-class) setting. DMAD employs a dual memory bank to calculate feature distance and feature attention between normal and abnormal patterns, thereby encapsulating knowledge about normal and abnormal instances. This knowledge is then used to construct an enhanced representation for anomaly score learning. We evaluated DMAD on the MVTec-AD and VisA datasets. The results show that DMAD surpasses current state-of-the-art methods, highlighting DMAD’s capability in handling the complexities of real-world anomaly detection scenarios. The code will be made available. Copyright © 2024, The Authors. All rights reserved.

关键词： Anomaly detection

DSMNet: Deep High-Precision 3-D Surface Modeling from Sparse Point Cloud Frames

学校读者我要写书评

暂无评论

IEEE Geoscience and Remote Sensing Letters 2023年 20卷 1-1页

作者： Qiu, Changjie Wang, Zhiyong Lin, Xiuhong Zang, Yu Wang, Cheng Liu, Weiquan Xiamen University Fujian Key Laboratory of Sensing and Computing for Smart Cities The Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen361005 China

Existing point cloud modeling datasets primarily express the modeling precision by pose or trajectory precision rather than the point cloud modeling effect itself. Under this demand, we first independently construct a set of LiDAR system with an optical stage, and then we build an HPMB dataset based on the constructed LiDAR system, a High-Precision, Multi-Beam, real-world dataset. Second, we propose a modeling evaluation method based on HPMB for object-level modeling to overcome this limitation. In addition, the existing point cloud modeling methods tend to generate continuous skeletons of the global environment, hence lacking attention to the shape of complex objects. To tackle this challenge, we propose a novel learning-based joint framework, DSMNet, for high-precision 3-D surface modeling from sparse point cloud frames. DSMNet comprises density-aware point cloud registration (PCR) and geometry-aware point cloud sampling (PCS) to effectively learn the implicit structure feature of sparse point clouds. Extensive experiments demonstrate that DSMNet outperforms the state-of-the-art methods in PCS and PCR on the Multi-View Partial Point Cloud (MVP) database. Furthermore, the experiments on the open-source KITTI and our proposed HPMB datasets show that DSMNet can be generalized as a postprocessing of simultaneous localization and mapping (SLAM), thereby improving modeling precision in environments with sparse point clouds. © 2004-2012 IEEE.

关键词： Optical radar

Improving Multilingual Sign Language Translation with Automatically Clustered Language Family Information 31

学校读者我要写书评

暂无评论

Improving Multilingual Sign Language Translation with Automa...

31st International Conference on Computational Linguistics, COLING 2025

作者： Zhang, Ruiquan Hu, Cong Yu, Pei Chen, Yidong Department of Artificial Intelligence School of Informatics Xiamen University 361005 China Ministry of Culture and Tourism 361005 China National Language Resources Monitoring and Research Center for Education and Teaching Media Xiamen University 361005 China Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China

ISBN: (纸本)9798891761964

Sign Language Translation (SLT) bridges the communication gap between deaf and hearing individuals by converting sign language videos into spoken language texts. While most SLT research has focused on bilingual translation models, the recent surge in interest has led to the exploration of Multilingual Sign Language Translation (MSLT). However, MSLT presents unique challenges due to the diversity of sign languages across nations. This diversity can lead to cross-linguistic conflicts and hinder translation accuracy. To use the similarity of actions and semantics between sign languages to alleviate conflict, we propose a novel approach that leverages sign language families to improve MSLT performance. Sign languages were clustered into families automatically based on their Language distribution in the MSLT network. We compare the results of our proposed family clustering method with the analysis conducted by sign language linguists and then train dedicated translation models for each family in the many-to-one translation scenario. Our experiments on the SP-10 dataset demonstrate that our approach can achieve a balance between translation accuracy and computational cost by regulating the number of language families. The source codes and models are available at FamilyCluST. © 2025 Association for Computational Linguistics.

关键词： Semantics

efficient Infrared Image Super-Resolution Reconstruction via Guided Filter Coefficients Estimation with Parallax Attention Mechanism

学校读者我要写书评

暂无评论

Efficient Infrared Image Super-Resolution Reconstruction via...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Qingyao Wu Bosheng Chen Chen Li Xiaotong Tu Xinghao Ding Yue Huang Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University P.R. China National Key Laboratory of Infrared Detection Technologies Shanghai China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Due to the spectral range mismatch between the images, building an efficient infrared (IR) image super-resolution algorithm suitable for embedded devices remains a significant challenge. Given that visible images possess more abundant high-frequency information compared to infrared images, we utilize the visible light to guide infrared image super-resolution reconstruction. Specifically, we transfer the reconstruction task to a guided filter learning process, whose coefficients are estimated by joint learning of visible and infrared image to complete the reconstruction through homologous constraints. In order to efficiently predict guided filter coefficients, we design a lightweight network which incorporates reparameterized differential convolution blocks and a feature fusion strategy. Striving to enhance the fusion strategy performance, we utilize parallax attention mechanism to solve the non-pixel registration problem between infrared and visible images. Extensive experiments on two challenging IR image datasets show that our method performs SOTA in terms of PSNR, SSIM and LPIPS as compared to current state-of-the-art approaches while showing its effectiveness and practicality in the edge platform of RK3588.

关键词： Measurement Attention mechanisms Convolution Image edge detection Superresolution Network architecture Filtering algorithms Information filters Speech processing Image reconstruction

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Jiang, Yutao Wu, Qiong Lin, Wenhao Yu, Wei Zhou, Yiyi Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Institute of Artificial Intelligence Xiamen University 361005 China

Recent Multimodal Large Language Models (MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune. In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background. To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks. The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95% and 2.34% accuracy drops, respectively. Our code is available at https://***/jytmelon/G-Prune. © 2025, CC0.

关键词： Visual BASIC