With a spurt of progress in deep learning techniques, convolutional neural network-based and transformer-based methods have yielded impressive performance on the hyperspectral image (HSI) classification tasks. However...
详细信息
With a spurt of progress in deep learning techniques, convolutional neural network-based and transformer-based methods have yielded impressive performance on the hyperspectral image (HSI) classification tasks. However, pixel-level manual annotation is time-consuming and laborious, and the small amount of labeled HSI data brings challenges to deep learning methods. Existing methods use carefully designed network architectures combined with self-supervised or semi-supervised learning to deal with the lack of training samples. Those methods were designed for specific datasets and often needed to tune hyperparameters on new datasets carefully. To tackle this problem, a unified HSI masked autoencoder framework was proposed for HSI classification. Different from existing works, the hyperspectral image masked autoencoder (HSIMAE) framework was pretrained on a large-scale unlabeled HSI dataset, named HSIHybrid, which contained a large amount of HSI data acquired by different sensors. First, to handle the different spectral ranges of HSIs, a group-wise PCA was applied to extract features of HSI spectra and transform them into fixed-length vectors. Then, a modified masked autoencoder was proposed for large-scale pretraining. It utilized separate spatial-spectral encoders followed by fusion blocks to learn spatial correlation and spectral correlation of HSI data. Finally, to leverage the unlabeled data of the target dataset, a dual-branch finetuning framework that used an extra unlabeled branch for mask modeling learning was introduced. Extensive experiments were conducted on four HSI datasets from different hyperspectral sensors. The results demonstrate the superiority of the proposed HSIMAE framework over the state-of-the-art methods, even with very few training samples.
Audio -Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion -aware intelligent machines. Previous efforts in this area are dominated by the supe...
详细信息
Audio -Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion -aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self -supervised learning, we propose Hierarchical Contrastive masked autoencoder (HiCMAE), a novel self -supervised framework that leverages large-scale self -supervised pre -training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self -supervised audio-visual representation learning, HiCMAE adopts two primary forms of self -supervision for pre -training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top -layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three -pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross -modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross -modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi -level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self -supervised audio-visual methods, whi
作者:
Liu, JiamingWu, YueGong, MaoguoLiu, ZhixiaoMiao, QiguangMa, WenpingXidian Univ
Sch Comp Sci & Technol Key Lab Collaborat Intelligence Syst Minist Educ Xian 710071 Peoples R China Xidian Univ
Sch Elect Engn Key Lab Collaborat Intelligence Syst Minist Educ Xian 710071 Peoples R China Harbin Engn Univ
Yantai Res Inst Yantai 264006 Peoples R China Xidian Univ
Sch Artificial Intelligence Key Lab Intelligent Percept & Image Understanding Minist Educ Xian 710071 Peoples R China
masked autoencoder (MAE) is a recently widely used self-supervised learning method that has achieved great success in NLP and computer vision. However, the potential advantages of masked pre-training for point cloud u...
详细信息
masked autoencoder (MAE) is a recently widely used self-supervised learning method that has achieved great success in NLP and computer vision. However, the potential advantages of masked pre-training for point cloud understanding have not been fully explored. There is preliminary work on MAE-based point clouds using the Transformer architecture to explore low-level geometric representations in 3D space, which is insufficient for fine-grained decoding completion and downstream tasks. Inspired by multimodality, we propose Inter-MAE, a inter-modal MAE method for self-supervised learning on point clouds. Specifically, we first use Point-MAE as a baseline to partition point clouds into random low percentage of visible and high percentage of masked point patches. Then, a standard Transformer-based autoencoder is built by asymmetric design and shifting mask operations, and latent features are learned from the visible point patches aiming to recover the masked point patches. In addition, we generate image features based on ViT after point cloud rendering to form inter-modal contrastive learning with the decoded features of the completed point patches. Extensive experiments show that the proposed Inter-MAE generates pre-trained models that are effective and exhibit superior results in various downstream tasks. For example, an accuracy of 85.4% is achieved on ScanObjectNN and 86.3% on ShapeNetPart, outperforming other state-of-the-art self-supervised learning methods. Notably, our work establishes for the first time the feasibility of applying image modality to masked point clouds.
An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked au...
详细信息
An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be b...
详细信息
This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.
Wafer bin map (WBM) automatic classification is one of the critical challenges for semiconductor intelligent manufacturing. Many deep learning -based classification models have performed well in WBM classification, bu...
详细信息
Wafer bin map (WBM) automatic classification is one of the critical challenges for semiconductor intelligent manufacturing. Many deep learning -based classification models have performed well in WBM classification, but all require a large amount of labeled data for training. Since real -world WBMs are highly complex and can be labeled correctly only by seasoned engineers, such requirements undermine the practical value of those methods. Several self -supervised learning methods have recently been proposed for WBM to improve classification performance. However, they still require much labeled data for fine-tuning and are only adapted for binary WBM with a single gross failure area. To address these limitations, this study introduces a selfsupervised framework based on masked autoencoder (MAE) for complex WBMs with mixed bin signatures and multiple gross failure area patterns. A patchMC encoder is proposed to improve MAE's representation ability for complex WBMs with mixed bin signatures. Moreover, the pre -trained MAE encoder with a multilabel classifier fine-tuned by labeled WBMs enables a few -shot classification of complex WBMs with multiple gross failure areas. Experimental validation of the proposed method is performed on a real -world complex WBM dataset from Intel Corporation. The results demonstrate that the proposed method can make good use of unlabeled WBMs and reduce the demand for labeled data to a few -shot level and, at the same time, guarantees a classification accuracy of more than 90%. By comparing MAE with other self -supervised learning methods, MAE outperforms other existing self -supervised methods for WBM data.
Deep learning methods have shown significant advantages in polarimetric synthetic aperture radar (PolSAR) image classification. However, their performances rely on a large number of labeled data. To alleviate this pro...
详细信息
Deep learning methods have shown significant advantages in polarimetric synthetic aperture radar (PolSAR) image classification. However, their performances rely on a large number of labeled data. To alleviate this problem, this paper proposes a PolSAR image classification method with a masked autoencoder based on Position prediction and Memory tokens (MAPM). First, MAPM designs a masked autoencoder (MAE) based on the transformer for pre-training, which can boost feature learning and improve classification results based on the number of labeled samples. Secondly, since the transformer is relatively insensitive to the order of the input tokens, a position prediction strategy is introduced in the encoder part of the MAE. It can effectively capture subtle differences and discriminate complex, blurry boundaries in PolSAR images. In the fine-tuning stage, the addition of learnable memory tokens can improve classification performance. In addition, L1 loss is used for MAE optimization to enhance the robustness of the model to outliers in PolSAR data. Experimental results show the effectiveness and advantages of the proposed MAPM in PolSAR image classification. Specifically, MAPM achieves performance gains of about 1% in classification accuracy compared with existing methods.
作者:
He, YuanHu, GuyueYu, ShanChinese Acad Sci
Inst Automat Lab Brain Atlas & Brain Inspired Intelligence Beijing 100190 Peoples R China Univ Chinese Acad Sci
Sch Artificial Intelligence Beijing 100049 Peoples R China Anhui Univ
Sch Artificial Intelligence Hefei 230039 Peoples R China Chinese Acad Sci
State Key Lab Brain Cognit & Brain Inspired Intell Beijing 100049 Peoples R China Univ Chinese Acad Sci
Sch Future Technol Beijing 100049 Peoples R China
masked autoencoder (MAE) has shown remarkable potential in self-supervised representation learning for 3D point clouds. However, these methods primarily rely on point-level or low-level feature reconstruction, forcing...
详细信息
masked autoencoder (MAE) has shown remarkable potential in self-supervised representation learning for 3D point clouds. However, these methods primarily rely on point-level or low-level feature reconstruction, forcing the model to focus on local regions while lacking enough global discriminability in the feature representation. Moreover, conventional masking strategies randomly mask some point patches, thereby neglecting the semantic structure of the point cloud and hindering the holistic understanding of global information and geometric structures. To address these challenges, we proposed a Contrastive Semantic-aware masked autoencoder (Point-CSMAE), which is equipped with a semantic-aware masking (SAM) strategy and a contrastive regularization (CR) mechanism. Specifically, the semantic-aware masking strategy adaptively selects patches with richer semantic information for masking and reconstruction, enhancing the understanding of global geometric structure. Furthermore, the contrastive regularization mechanism adaptively aligns the global information between the masked and visible parts, thus improving the learned global semantic representation. Meanwhile, the CR mechanism assists the SAM strategy with effective global semantic representations. Extensive experiments on various downstream tasks, including shape classification, few-shot classification, and part segmentation, demonstrate the superiority of the proposed approach.
Cancer survival prediction requires exploiting related multimodal information (e.g., pathological, clinical and genomic features, etc.) and it is even more challenging in clinical practices due to the incompleteness o...
详细信息
Cancer survival prediction requires exploiting related multimodal information (e.g., pathological, clinical and genomic features, etc.) and it is even more challenging in clinical practices due to the incompleteness of patient's multimodal data. Furthermore, existing methods lack sufficient intra- and inter-modal interactions, and suffer from significant performance degradation caused by missing modalities. This manuscript proposes a novel hybrid graph convolutional network, entitled HGCN, which is equipped with an online masked autoencoder paradigm for robust multimodal cancer survival prediction. Particularly, we pioneer modeling the patient's multimodal data into flexible and interpretable multimodal graphs with modality-specific preprocessing. HGCN integrates the advantages of graph convolutional networks (GCNs) and a hypergraph convolutional network (HCN) through node message passing and a hyperedge mixing mechanism to facilitate intra-modal and inter-modal interactions between multimodal graphs. With HGCN, the potential for multimodal data to create more reliable predictions of patient's survival risk is dramatically increased compared to prior methods. Most importantly, to compensate for missing patient modalities in clinical scenarios, we incorporated an online masked autoencoder paradigm into HGCN, which can effectively capture intrinsic dependence between modalities and seamlessly generate missing hyperedges for model inference. Extensive experiments and analysis on six cancer cohorts from TCGA show that our method significantly outperforms the state-of-the-arts in both complete and missing modal settings. Our codes are made available at https://***/lin-lcx/HGCN.
Wafer Map Pattern Recognition (WMPR) is a critical aspect of semiconductor manufacturing. It indicates how to improve the manufacturing yields as we probe into the failure issues of the processes. In literature works,...
详细信息
Wafer Map Pattern Recognition (WMPR) is a critical aspect of semiconductor manufacturing. It indicates how to improve the manufacturing yields as we probe into the failure issues of the processes. In literature works, researchers often use balanced datasets with ample datapoints to address WMPR tasks, however, novel defects often emerge with few previous observations in real-world manufacturing. Unfortunately, efforts to solve WMPR problems in few-shot scenarios remain scanty. To bridge this gap, we define a new task, Few Shot Wafer Map Pattern Recognition(FSWMPR), which attempts to learning a classifier to distinguish unseen classes with only a few labeled instances available. In such a task, expeditiously learning transferable feature embeddings is extremely challenging. In this paper, we propose an innovative two-stage strategy to wrestle with the problem of FSWMPR. In the first stage, we leverage a masked autoencoder to obtain efficacious representations of defect wafer map images through reconstructing pixel values of masked patches based on smooth-l1 loss. In the second stage, we create a novel finetuning mechanism, "Dynamic Multi-Loss Adaptation Mechanism", which utilize three cooperative losses to accelerate fast feature transfer for few-shot scenarios. Surprisingly, if three losses are reduced to one comparative loss, we still achieve more competitive accuracy than those meta- learning or finetuning methods, which is worth noting that our two stages involve no label information at all. Extensive experiments and analyses are conducted on WM811K datasets. Compared with other algorithms, our methods offer fresh solutions by creatively integrating self-supervised masked autoencoder with a novel finetune mechanism which is efficacious for FSWMPR.
暂无评论