检索结果-内蒙古大学图书馆

Self-Supervised Pretraining for Fine-Grained Plankton recognition

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Kareinen, Joona Eerola, Tuomas Kraft, Kaisa Lensu, Lasse Suikkanen, Sanna Kälviäinen, Heikki LUT University Computer Vision and Pattern Recognition Laboratory Lappeenranta Finland Finnish Environment Institute Helsinki Finland

Plankton recognition is an important computer vision problem due to plankton’s essential role in ocean food webs and carbon capture, highlighting the need for species-level monitoring. However, this task is challenging due to its fine-grained nature and dataset shifts caused by different imaging instruments and varying species distributions. As new plankton image datasets are collected at an increasing pace, there is a need for general plankton recognition models that require minimal expert effort for data labeling. In this work, we study large-scale self-supervised pretraining for fine-grained plankton recognition. We first employ masked autoencoding and a large volume of diverse plankton image data to pretrain a general-purpose plankton image encoder. Then we utilize fine-tuning to obtain accurate plankton recognition models for new datasets with a very limited number of labeled training images. Our experiments show that self-supervised pretraining with diverse plankton data clearly increases plankton recognition accuracy compared to standard ImageNet pretraining when the amount of training data is limited. Moreover, the accuracy can be further improved when unlabeled target data is available and utilized during the pretraining. © 2025, CC BY.

关键词： Carbon sequestration

Open-Set Plankton recognition

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Kareinen, Joona Skyttä, Annaliina Eerola, Tuomas Kraft, Kaisa Lensu, Lasse Suikkanen, Sanna Lehtiniemi, Maiju Kälviäinen, Heikki Computer Vision and Pattern Recognition Laboratory Lappeenranta-Lahti University of Technology LUT Lappeenranta Finland Finnish Environment Institute Helsinki Finland

This paper considers open-set recognition (OSR) of plankton images. Plankton include a diverse range of microscopic aquatic organisms that have an important role in marine ecosystems as primary producers and as a base of food webs. Given their sensitivity to environmental changes, fluctuations in plankton populations offer valuable information about oceans’ health and climate change motivating their monitoring. Modern automatic plankton imaging devices enable the collection of large-scale plankton image datasets, facilitating species-level analysis. Plankton species recognition can be seen as an image classification task and is typically solved using deep learning-based image recognition models. However, data collection in real aquatic environments results in imaging devices capturing a variety of non-plankton particles and plankton species not present in the training set. This creates a challenging fine-grained OSR problem, characterized by subtle differences between taxonomically close plankton species. We address this challenge by conducting extensive experiments on three OSR approaches using both phyto- and zooplankton images analyzing also on the effect of the rejection thresholds for OSR. The results demonstrate that high OSR accuracy can be obtained promoting the use of these methods in operational plankton research. We have made the data publicly available to the research community. © 2025, CC BY.

关键词： Biotic

Multi-stage query-based feature generating and encoding for robust early action recognition

学校读者我要写书评

暂无评论

Visual computer 2025年 1-17页

作者： Chen, Jie Pan, Wei-Xiang Zhang, Hong-Bo Lin, Ming-Xuan Lei, Qing Liu, Jing-Hua Department of Computer Science and Technology Huaqiao University Xiamen 361021 China Fujxian Key Laboratory of Big Data Intelligence and Security Huaqiao University Xiamen 361021 China Xiamen Key Laboratory of Computer Vision and Pattern Recognition Huaqiao University Xiamen 361021 China

This paper proposes a novel early action recognition (EAR) method based on multi-stage query-based feature generation and encoding. Existing EAR approaches often struggle with the similarity of initial action features, making it challenging to accurately extract discriminative information. To address this, our method divides the unobserved feature reconstruction process into multiple sub-stages, with each stage focusing on generating and restoring a small segment of action information. This segmentation strategy ensures that the reconstructed action information is more consistent with real-world scenarios. Furthermore, we introduce a query encoding network to model the relationships between sub-stage action features, effectively integrating them to enrich feature representations, enhance model generalization, and improve sequence coherence. Experimental results on public datasets HMDB51 and UCF101 demonstrate that our method significantly outperforms existing methods, achieving robust and accurate early action recognition. Specifically, its ability to handle action progression under different observation rates also specifically reflects the robustness of the proposed method. The code of this work is publicly available at https://***/Chenjie0921/Multiple-stage. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2025.

关键词： Early action recognition Multi-stage query Query encoding Unobserved feature reconstruction

CodePhys: Robust Video-Based Remote Physiological Measurement Through Latent Codebook Querying

学校读者我要写书评

暂无评论

IEEE Journal of Biomedical and Health Informatics 2025年 PP卷 PP页

作者： Chu, Shuyang Xia, Menghan Yuan, Mengyao Liu, Xin Seppanen, Tapio Zhao, Guoying Shi, Jingang Xi'an Jiaotong University School of Software Engineering Xi'an China Tencent Ai Lab Shenzhen China Lappeenranta-Lahti University of Technology Lut Computer Vision and Pattern Recognition Laboratory Lappeenranta53850 Finland University of Oulu Center for Machine Vision and Signal Analysis Finland

Remote photoplethysmography (rPPG) aims to measure non-contact physiological signals from facial videos, which has shown great potential in many applications. Most existing methods directly extract video-based rPPG features by designing neural networks for heart rate estimation. Although they can achieve acceptable results, the recovery of rPPG signal faces intractable challenges when interference from real-world scenarios takes place on facial video. Specifically, facial videos are inevitably affected by non-physiological factors (e.g., camera device noise, defocus, and motion blur), leading to the distortion of extracted rPPG signals. Recent rPPG extraction methods are easily affected by interference and degradation, resulting in noisy rPPG signals. In this paper, we propose a novel method named CodePhys, which innovatively treats rPPG measurement as a code query task in a noise-free proxy space (i.e., codebook) constructed by ground-truth PPG signals. We consider noisy rPPG features as queries and generate high-fidelity rPPG features by matching them with noise-free PPG features from the codebook. Our approach also incorporates a spatial-aware encoder network with a spatial attention mechanism to highlight physiologically active areas and uses a distillation loss to reduce the influence of non-periodic visual interference. Experimental results on four benchmark datasets demonstrate that CodePhys outperforms state-of-the-art methods in both intra-dataset and cross-dataset settings. © 2025 IEEE.

关键词： Heart

CodePhys: Robust Video-based Remote Physiological Measurement through Latent Codebook Querying

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Chu, Shuyang Xia, Menghan Yuan, Mengyao Liu, Xin Seppanen, Tapio Zhao, Guoying Shi, Jingang The School of Software Engineering Xi’an Jiaotong University Xi’an China The Tencent AI Lab Shenzhen China The Computer Vision and Pattern Recognition Laboratory Lappeenranta-Lahti University of Technology LUT Lappeenranta53850 Finland The Center for Machine Vision and Signal Analysis University of Oulu Finland

关键词： Heart

SIR-HCL: Semantic-Inconsistency Reasoning and Hybrid Contrastive Learning for Efficient Cross-Emotion Anomaly Detection

学校读者我要写书评

暂无评论

IEEE Transactions on Cognitive and Developmental Systems 2025年

作者： Liu, Xin Chen, Qiyan Cheung, Yiu-Ming Peng, Shu-Juan Huaqiao University Department of Computer Science Xiamen361021 China Hong Kong Baptist University Department of Computer Science SAR Hong Kong Hong Kong Xiamen Key Laboratory of Computer Vision and Pattern Recognition Xiamen361021 China Huaqiao University Fujian Key Laboratory of Big Data Intelligence and Security Xiamen361021 China Huaqiao University Department of Artificial Intelligence Xiamen China Fujian Province University Key Laboratory of Computer Vision and Machine Learning Huaqiao University Xiamen361021 China

Cross-emotion anomaly detection is an emerging and challenging research topic in cognitive analysis field, which aims at identifying the abnormal emotion pair whose semantic patterns are inconsistent across different emotional modalities. To the best of our knowledge, this topic has yet to be well studied, which could potentially benefit lots of valuable cognitive applications such as autistic children diagnosis and criminal deception detection. To this end, this paper proposes an efficient cross-emotion anomaly detection approach via semanticinconsistency reasoning and hybrid contrastive learning (SIR-HCL), which is the first attempt to detect the anomalous emotional pairs across the audio-visual emotions. First, the proposed framework utilizes dual-branch network to obtain the deep emotional features in each modality, and then employs the shared residual block to derive the semantically compatible features. Subsequently, an efficient hybrid contrastive learning approach is designed to enlarge the semantic-inconsistency among abnormal emotional pair with different affective classes, while enhancing the semantic-consistency and increasing the feature correlation between normal emotional pair from the same affective class. At the same time, an efficient bidirectional learning scheme is employed to significantly improve the data utilization and a two-component Beta Mixture Model is adaptively utilized to reason the anomalous emotion pairs. Extensive experiments evaluated on two benchmark datasets show that the proposed SIR-HCL method can well detect the anomalous emotional pairs across audio-visual emotional data, and brings substantial improvements over the state-of-the-art competing methods. © 2016 IEEE.

关键词： Contrastive Learning

Zero-Shot Audio Captioning Using Soft and Hard Prompts

学校读者我要写书评

暂无评论

IEEE Transactions on Audio, Speech and Language Processing

IEEE Transactions on Audio, Speech and Language Processing 2025年 33卷 2045-2058页

作者： Yiming Zhang Xuenan Xu Ruoyi Du Haohe Liu Yuan Dong Zheng-Hua Tan Wenwu Wang Zhanyu Ma Pattern Recognition and Intelligent System Laboratory School of Artificial Intelligence Beijing University of Posts and Telecommunications Beijing China Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Centre for Vision Speech and Signal Processing University of Surrey Guildford U.K. Department of Electronic Systems Aalborg University Aalborg Denmark

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

关键词： Training Decoding Semantics Data models Acoustics Electronic mail Benchmark testing Transformers Robustness Perturbation methods

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Chen, Boyu Yue, Zhengrong Chen, Siran Wang, Zikang Liu, Yang Li, Peng Wang, Yali Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China School of Artificial Intelligence University of Chinese Academy of Sciences China Tsinghua University Beijing China Dept. of Comp. Sci. & Tech. Institute for AI Tsinghua University Beijing China Shanghai Artificial Intelligence Laboratory China Shanghai Jiao Tong University China

Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video-related questions and exchange reasons. 4) Reflection: We evaluate each agent’s performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA. © 2025, CC BY-NC-SA.

关键词： Open systems

Revisiting the Generalization Problem of Low-level vision Models Through the Lens of Image Deraining

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Hu, Jinfan You, Zhiyuan Gu, Jinjin Zhu, Kaiwen Xue, Tianfan Dong, Chao Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen518055 China University of Chinese Academy of Sciences Beijing100049 China The Chinese University of Hong Kong 999077 Hong Kong The University of Sydney NSW2006 Australia Shanghai Jiao Tong University Shanghai200240 China Shanghai Artificial Intelligence Laboratory Shanghai200232 China Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Shenzhen University of Advanced Technology Shenzhen518055 China

Generalization remains a significant challenge for low-level vision models, which often struggle with unseen degradations in real-world scenarios despite their success in controlled benchmarks. In this paper, we revisit the generalization problem in low-level vision models. Image deraining is selected as a case study due to its well-defined and easily decoupled structure, allowing for more effective observation and analysis. Through comprehensive experiments, we reveal that the generalization issue is not primarily due to limited network capacity but rather the failure of existing training strategies, which lead networks to overfit specific degradation patterns. Our findings show that guiding networks to focus on learning the underlying image content, rather than the degradation patterns, is key to improving generalization. We demonstrate that balancing the complexity of background images and degradations in the training data helps networks better fit the image distribution. Furthermore, incorporating content priors from pre-trained generative models significantly enhances generalization. Experiments on both image deraining and image denoising validate the proposed strategies. We believe the insights and solutions will inspire further research and improve the generalization of low-level vision models. Copyright © 2025, The Authors. All rights reserved.

关键词： Image denoising