Wide-range multiscale object detection for multispectral scene perception from a drone perspective is challenging. Previous RGB-T perception methods directly use backbone pretrained on RGB for thermal infrared feature...
详细信息
Wide-range multiscale object detection for multispectral scene perception from a drone perspective is challenging. Previous RGB-T perception methods directly use backbone pretrained on RGB for thermal infrared feature extraction, leading to unexpected domain shift. We propose a novel multimodal feature-guided masked reconstruction pretraining method, named M2FP, aimed at learning transferable representations for drone-based RGB-T environmental perception tasks without domain bias. This article includes two key innovations as follows. 1) We design a cross-modal feature interaction module in M2FP, which encourages modality-specific backbones to actively learn cross-modal feature representations and avoid modality bias issues. 2) We design a global-aware feature interaction and fusion module suitable for various downstream tasks, which enhances the model's environmental perception from a global perspective in wide-range drone-based scenes. We fine-tune M2FP on the drone-based object detection dataset (DroneVehicle) and semantic segmentation dataset (Kust4K). On these two tasks, compared to the second-best methods, M2FP achieves state-of-the-art performance, with an improvement of 1.8% in mean average precision and 0.9% in mean intersection over union, respectively.
Accurately segmenting tubular structures, such as blood vessels or nerves, holds significant clinical implications across various medical applications. However, existing methods often exhibit limitations in achieving ...
详细信息
Accurately segmenting tubular structures, such as blood vessels or nerves, holds significant clinical implications across various medical applications. However, existing methods often exhibit limitations in achieving satisfactory topological performance, particularly in terms of preserving connectivity. To address this challenge, we propose a novel deep-learning approach, termed Deep Closing, inspired by the well-established classic closing operation. Deep Closing first leverages an autoencoder trained in the masked Image Modeling (MIM) paradigm, enhanced with digital topology knowledge, to effectively learn the inherent shape prior of tubular structures and indicate potential disconnected regions. Subsequently, a Simple Components Erosion module is employed to generate topology-focused outcomes, which refines the preceding segmentation results, ensuring all the generated regions are topologically significant. To evaluate the efficacy of Deep Closing, we conduct comprehensive experiments on 4 datasets: DRIVE, CHASE_DB1, DCA1, and CREMI. The results demonstrate that our approach yields considerable improvements in topological performance compared with existing methods. Furthermore, Deep Closing exhibits the ability to generalize and transfer knowledge from external datasets, showcasing its robustness and adaptability. The code for this paper has been available at: https://***/5k5000/DeepClosing.
The existing image semantic segmentation models have low accuracy in detecting tiny targets or multi-targets at overlapping regions. This work proposes a hybrid vision transformer with unified-perceptual-parsing netwo...
详细信息
The existing image semantic segmentation models have low accuracy in detecting tiny targets or multi-targets at overlapping regions. This work proposes a hybrid vision transformer with unified-perceptual-parsing network (ViT-UperNet) for medical image segmentation. A self-attention mechanism is embedded in a vision transformer to extract multi-level features. The image features are extracted hierarchically from low to high dimensions using 4 groups of Transformer blocks with different numbers. Then, it uses a unified-perceptual-parsing network based on a feature pyramid network (FPN) and a pyramid pooling module (PPM) for the fusion of multi-scale contextual features and semantic segmentation. FPN can naturally use hierarchical features, and generate strong semantic information on all scales. PPM can better use the global prior knowledge to understand complex scenes, and extract features with global context information to improve segmentation results. In the training process, a scalable self-supervised learner named masked autoencoder is used for pre-training, which strengthens the visual representation ability and improves the efficiency of the feature learning. Experiments are conducted on cardiac magnetic resonance image segmentation where the left and right atrium and ventricle are selected for segmentation. The pixels accuracy is 93.85%, the Dice coefficient is 92.61% and Hausdorff distance is 11.16, which are improved compared with the other methods. The results show the superiority of Vit-UperNet in medical images segmentation, especially for the low-recognition and serious-occlusion targets.
Detecting anomalies in manufacturing processes is crucial for ensuring safety. However, noise significantly undermines the reliability of data-driven anomaly detection models. To address this challenge, we propose a s...
详细信息
Detecting anomalies in manufacturing processes is crucial for ensuring safety. However, noise significantly undermines the reliability of data-driven anomaly detection models. To address this challenge, we propose a slow feature-constrained decomposition autoencoder (SFC-DAE) for anomaly detection in noisy scenarios. Considering that the process can exhibit both long-term trends and periodic properties, the process data is decomposed into trends and cycles. The repetitive information is mitigated by slicing and randomly masking certain trends and cycles. Dependencies among slices are constructed to extract intrinsic information, while high-frequency noise is reduced using a slow feature-constrained loss. Anomalies are detected and localized through a reconstruction error strategy. The effectiveness of SFC-DAE is demonstrated using data from a sugar factory and a secure water treatment system.
Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature ***,the training of deep neural networks requires a large number of labeled dat...
详细信息
Deep convolutional neural networks(DCNNs)are widely used in content-based image retrieval(CBIR)because of the advantages in image feature ***,the training of deep neural networks requires a large number of labeled data,which limits the ***-supervised learning is a more general approach in unlabeled scenarios.A method of fine-tuning feature extraction networks based on masked learning is *** autoencoders(MAE)are used in the fine-tune vision transformer(ViT)*** addition,the scheme of extracting image descriptors is *** encoder of the MAE uses the ViT to extract global features and performs self-supervised fine-tuning by reconstructing masked area *** method works well on category-level image retrieval datasets with marked improvements in instance-level *** the instance-level datasets Oxford5k and Paris6k,the retrieval accuracy of the base model is improved by 7%and 17%compared to that of the original model,respectively.
Muscle atrophy is a widespread disease that can reduce quality of life and increase morbidity and mortality. The development of noninvasive method to evaluate muscle atrophy is of great practical value. However, obtai...
详细信息
ISBN:
(纸本)9789819985456;9789819985463
Muscle atrophy is a widespread disease that can reduce quality of life and increase morbidity and mortality. The development of noninvasive method to evaluate muscle atrophy is of great practical value. However, obtaining accurate criteria for the evaluation of muscle atrophy under non-invasive conditions is extremely difficult. This paper proposes a self-supervised temporal ultrasound reconstruction method based on masked autoencoder to explore the dynamic process of muscle atrophy. A score-position embedding is designed to realize the quantitative evaluation of muscle atrophy. Ultrasound images of the hind limb muscle of six macaque monkeys were acquired consecutively during 38 days of head-down bed rest experiments. Given an ultrasound image sequence, an asymmetric encoder-decoder structure is used to reconstruct the randomly masked images for the purpose of modelling the dynamic muscle atrophy process. We demonstrate the feasibility of using the position indicator as muscle atrophy score, which can be used to predict the degree of muscle atrophy. This study achieves the quantitative evaluation of muscle atrophy in the absence of accurate evaluation criteria for muscle atrophy.
Anomaly detection on attributed networks is of wide practical application in many domains, such as business and cybersecurity. Typically, existing methods mainly focus on utilizing the graph neural networks (GNNs) tha...
详细信息
ISBN:
(纸本)9789819755745;9789819755752
Anomaly detection on attributed networks is of wide practical application in many domains, such as business and cybersecurity. Typically, existing methods mainly focus on utilizing the graph neural networks (GNNs) that aggregate information from neighbors to learn node representations for detecting anomalies. However, it may ignore the information beyond the neighbors like community associations. Furthermore, roughly stacking multiple GNNs layers may lead to the over-smoothing problem, making nodes representations more similar and anomalies undistinguishable. In this paper, we propose a novel method, named CARD, to tackle these issues. Specifically, we propose different augmentation strategies to offer diverse scale information for CARD. Then, to better capture community associations, we establish a community-guided contrastive learning module that can capture different scale of structure information as well. To capture multiple attribute information and aid in anomaly detection, we design an anomaly-aware masked autoencoder, effectively making anomalies more distinguished. Extensive experiments on nine datasets show the superiority of CARD. Our code are available at https://***/scu-kdde/OAM-CARD-2024.
This paper presents a novel approach to representation learning in recommender systems by integrating generative self-supervised learning with graph transformer architecture. We highlight the importance of high-qualit...
详细信息
ISBN:
(纸本)9781450394086
This paper presents a novel approach to representation learning in recommender systems by integrating generative self-supervised learning with graph transformer architecture. We highlight the importance of high-quality data augmentation with relevant self-supervised pretext tasks for improving performance. Towards this end, we propose a newapproach that automates the self-supervision augmentation process through a rationale-aware generative SSL that distills informative user-item interaction patterns. The proposed recommender with Graph TransFormer (GFormer) that offers parameterized collaborative rationale discovery for selective augmentation while preserving global-aware user-item relationships. In GFormer, we allow the rationale-aware SSL to inspire graph collaborative filtering with task-adaptive invariant rationalization in graph transformer. The experimental results reveal that our GFormer has the capability to consistently improve the performance over baselines on different datasets. Several in-depth experiments further investigate the invariant rationale-aware augmentation from various aspects. The source code for this work is publicly available at: https://***/HKUDS/GFormer.
The use of 3D point cloud is widespread in robotics and autonomous driving systems. With the development of deep learning, an increasing number of models are being proposed to address tasks in 3D point cloud processin...
详细信息
ISBN:
(纸本)9798350387780;9798350387797
The use of 3D point cloud is widespread in robotics and autonomous driving systems. With the development of deep learning, an increasing number of models are being proposed to address tasks in 3D point cloud processing, including shape classification and 3D object detection. However, training these models requires large amounts of labeled data, which is expensive to obtain. Therefore, self-supervised learning methods for 3D point clouds have recently gained significant attention, which train the model with unlabeled data by designing pretext tasks. This paper aims to review these methods. Based on the pretext tasks designed, we divide those more than 30 existing methods into three categories: reconstruction-based, contrastive-based and MAE(masked autoencoder)-based methods. Then, we introduce the research motivations, implementations and characteristics of these methods one by one. Finally, two performance evaluation criteria are introduced and the performance of each selfsupervised learning method is assessed and analyzed based on these criteria.
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of ...
详细信息
ISBN:
(纸本)9798350353006
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of affective cues in the video. Inspired by psychology research and empirical theory, we verify that the degree of emotion may vary in different segments of the video, thus introducing the sentiment complementary and emotion intrinsic among temporal segments. We propose an MAE-style method for learning robust affective representation of videos via masking, termed MART. First, we extract the affective cues of the lexicon and verify the extracted one by computing its matching score with video content, in terms of sentiment and emotion scores alongside the temporal dimension. Then, with the verified cues, we propose masked affective modeling to recover temporal emotion distribution. We present temporal affective complementary learning that pulls the complementary part and pushes the intrinsic one of masked multimodal features, where the constraint is set with cross-modal attention among features to mask the video and recover the degree of emotion among segments. Extensive experiments on five benchmarks show the superiority of our method in video sentiment analysis, video emotion recognition, multimodal sentiment analysis, and multimodal emotion recognition.
暂无评论