检索结果-内蒙古大学图书馆

HSIMAE: A Unified masked autoencoder With Large-Scale Pretraining for Hyperspectral Image Classification

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 2024年 17卷 14064-14079页

作者： Wang, Yue Wen, Ming Zhang, Hailiang Sun, Jinyu Yang, Qiong Zhang, Zhimin Lu, Hongmei Cent South Univ Coll Chem & Chem Engn Changsha 410083 Peoples R China

With a spurt of progress in deep learning techniques, convolutional neural network-based and transformer-based methods have yielded impressive performance on the hyperspectral image (HSI) classification tasks. However, pixel-level manual annotation is time-consuming and laborious, and the small amount of labeled HSI data brings challenges to deep learning methods. Existing methods use carefully designed network architectures combined with self-supervised or semi-supervised learning to deal with the lack of training samples. Those methods were designed for specific datasets and often needed to tune hyperparameters on new datasets carefully. To tackle this problem, a unified HSI masked autoencoder framework was proposed for HSI classification. Different from existing works, the hyperspectral image masked autoencoder (HSIMAE) framework was pretrained on a large-scale unlabeled HSI dataset, named HSIHybrid, which contained a large amount of HSI data acquired by different sensors. First, to handle the different spectral ranges of HSIs, a group-wise PCA was applied to extract features of HSI spectra and transform them into fixed-length vectors. Then, a modified masked autoencoder was proposed for large-scale pretraining. It utilized separate spatial-spectral encoders followed by fusion blocks to learn spatial correlation and spectral correlation of HSI data. Finally, to leverage the unlabeled data of the target dataset, a dual-branch finetuning framework that used an extra unlabeled branch for mask modeling learning was introduced. Extensive experiments were conducted on four HSI datasets from different hyperspectral sensors. The results demonstrate the superiority of the proposed HSIMAE framework over the state-of-the-art methods, even with very few training samples.

关键词： Hyperspectral image (HSI) classification large-scale pretraining masked autoencoder self-supervised learning transformer Hyperspectral image (HSI) classification large-scale pretraining masked autoencoder self-supervised learning transformer

来源：评论

学校读者我要写书评

暂无评论

HiCMAE: Hierarchical Contrastive masked autoencoder for self-supervised Audio-Visual Emotion Recognition

引用

INFORMATION FUSION 2024年 108卷

作者： Sun, Licai Lian, Zheng Liu, Bin Tao, Jianhua Univ Chinese Acad Sci Sch Artificial Intelligence Beijing Peoples R China Chinese Acad Sci Inst Automat Beijing Peoples R China Tsinghua Univ Dept Automat Beijing Peoples R China Tsinghua Univ Beijing Natl Res Ctr Informat Sci & Technol Beijing Peoples R China

Audio -Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion -aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self -supervised learning, we propose Hierarchical Contrastive masked autoencoder (HiCMAE), a novel self -supervised framework that leverages large-scale self -supervised pre -training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self -supervised audio-visual representation learning, HiCMAE adopts two primary forms of self -supervision for pre -training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top -layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three -pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross -modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross -modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi -level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self -supervised audio-visual methods, whi

关键词： Audio-Visual Emotion Recognition Self-supervised learning masked autoencoder Contrastive learning

来源：评论

学校读者我要写书评

暂无评论

Inter-Modal masked autoencoder for Self-Supervised Learning on Point Clouds

引用

IEEE TRANSACTIONS ON MULTIMEDIA 2024年 26卷 3897-3908页

作者： Liu, Jiaming Wu, Yue Gong, Maoguo Liu, Zhixiao Miao, Qiguang Ma, Wenping Xidian Univ Sch Comp Sci & Technol Key Lab Collaborat Intelligence Syst Minist Educ Xian 710071 Peoples R China Xidian Univ Sch Elect Engn Key Lab Collaborat Intelligence Syst Minist Educ Xian 710071 Peoples R China Harbin Engn Univ Yantai Res Inst Yantai 264006 Peoples R China Xidian Univ Sch Artificial Intelligence Key Lab Intelligent Percept & Image Understanding Minist Educ Xian 710071 Peoples R China

masked autoencoder (MAE) is a recently widely used self-supervised learning method that has achieved great success in NLP and computer vision. However, the potential advantages of masked pre-training for point cloud understanding have not been fully explored. There is preliminary work on MAE-based point clouds using the Transformer architecture to explore low-level geometric representations in 3D space, which is insufficient for fine-grained decoding completion and downstream tasks. Inspired by multimodality, we propose Inter-MAE, a inter-modal MAE method for self-supervised learning on point clouds. Specifically, we first use Point-MAE as a baseline to partition point clouds into random low percentage of visible and high percentage of masked point patches. Then, a standard Transformer-based autoencoder is built by asymmetric design and shifting mask operations, and latent features are learned from the visible point patches aiming to recover the masked point patches. In addition, we generate image features based on ViT after point cloud rendering to form inter-modal contrastive learning with the decoded features of the completed point patches. Extensive experiments show that the proposed Inter-MAE generates pre-trained models that are effective and exhibit superior results in various downstream tasks. For example, an accuracy of 85.4% is achieved on ScanObjectNN and 86.3% on ShapeNetPart, outperforming other state-of-the-art self-supervised learning methods. Notably, our work establishes for the first time the feasibility of applying image modality to masked point clouds.

关键词： Point cloud compression Transformers Task analysis Standards Computer architecture Decoding Self-supervised learning Self-supervision masked autoencoder joint multimodality point cloud understanding

来源：评论

学校读者我要写书评

暂无评论

A vector quantized masked autoencoder for audiovisual speech emotion recognition

引用

COMPUTER VISION AND IMAGE UNDERSTANDING 2025年 257卷

作者： Sadok, Samir Leglaive, Simon Seguier, Renaud CentraleSupelec IETR UMR 6164 CNRS Paris France Univ Grenoble Alpes INRIA CNRS LJK Grenoble France

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.

关键词： Self-supervised learning masked autoencoder Emotion recognition

来源：评论

学校读者我要写书评

暂无评论

Audio Super-Resolution With Robust Speech Representation Learning of masked autoencoder

引用

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2024年 32卷 1012-1022页

作者： Kim, Seung-Bin Lee, Sang-Hoon Choi, Ha-Yeong Lee, Seong-Whan Korea Univ Dept Artificial Intelligence Seoul 02841 South Korea

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.

关键词： Superresolution Task analysis Speech processing Self-supervised learning Computational modeling Decoding Training Audio super-resolution bandwidth extension self-supervised learning masked autoencoder audio synthesis

来源：评论

学校读者我要写书评

暂无评论

A self-supervised learning framework based on masked autoencoder for complex wafer bin map classification

引用

EXPERT SYSTEMS WITH APPLICATIONS 2024年第PartB期249卷

作者： Wang, Yi Ni, Dong Huang, Zhenyu Chen, Puyang Zhejiang Univ Coll Control Sci & Engn Hangzhou 310027 Peoples R China Intel Corp Dalian 116630 Peoples R China

Wafer bin map (WBM) automatic classification is one of the critical challenges for semiconductor intelligent manufacturing. Many deep learning -based classification models have performed well in WBM classification, but all require a large amount of labeled data for training. Since real -world WBMs are highly complex and can be labeled correctly only by seasoned engineers, such requirements undermine the practical value of those methods. Several self -supervised learning methods have recently been proposed for WBM to improve classification performance. However, they still require much labeled data for fine-tuning and are only adapted for binary WBM with a single gross failure area. To address these limitations, this study introduces a selfsupervised framework based on masked autoencoder (MAE) for complex WBMs with mixed bin signatures and multiple gross failure area patterns. A patchMC encoder is proposed to improve MAE's representation ability for complex WBMs with mixed bin signatures. Moreover, the pre -trained MAE encoder with a multilabel classifier fine-tuned by labeled WBMs enables a few -shot classification of complex WBMs with multiple gross failure areas. Experimental validation of the proposed method is performed on a real -world complex WBM dataset from Intel Corporation. The results demonstrate that the proposed method can make good use of unlabeled WBMs and reduce the demand for labeled data to a few -shot level and, at the same time, guarantees a classification accuracy of more than 90%. By comparing MAE with other self -supervised learning methods, MAE outperforms other existing self -supervised methods for WBM data.

关键词： Self-supervised learning masked autoencoder Complex wafer bin map Automatic defect classification Semiconductor manufacturing

来源：评论

学校读者我要写书评

暂无评论

MAPM:PolSAR Image Classification with masked autoencoder Based on Position Prediction and Memory Tokens

引用

REMOTE SENSING 2024年第22期16卷 4280-4280页

作者： Wang, Jianlong Li, Yingying Quan, Dou Hou, Beibei Wang, Zhensong Sima, Haifeng Sun, Junding Henan Polytech Univ Sch Comp Sci & Technol Jiaozuo 454003 Peoples R China Xidian Univ Sch Artificial Intelligence Key Lab Intelligent Percept & Image Understanding Minist Educ Xian 710071 Peoples R China Henan Polytech Univ Sch Software Jiaozuo 454003 Peoples R China

Deep learning methods have shown significant advantages in polarimetric synthetic aperture radar (PolSAR) image classification. However, their performances rely on a large number of labeled data. To alleviate this problem, this paper proposes a PolSAR image classification method with a masked autoencoder based on Position prediction and Memory tokens (MAPM). First, MAPM designs a masked autoencoder (MAE) based on the transformer for pre-training, which can boost feature learning and improve classification results based on the number of labeled samples. Secondly, since the transformer is relatively insensitive to the order of the input tokens, a position prediction strategy is introduced in the encoder part of the MAE. It can effectively capture subtle differences and discriminate complex, blurry boundaries in PolSAR images. In the fine-tuning stage, the addition of learnable memory tokens can improve classification performance. In addition, L1 loss is used for MAE optimization to enhance the robustness of the model to outliers in PolSAR data. Experimental results show the effectiveness and advantages of the proposed MAPM in PolSAR image classification. Specifically, MAPM achieves performance gains of about 1% in classification accuracy compared with existing methods.

关键词： polarimetric SAR masked autoencoder position prediction L1 loss memory tokens

来源：评论

学校读者我要写书评

暂无评论

Contrastive Semantic-Aware masked autoencoder for Point Cloud Self-Supervised Learning

引用

IEEE SIGNAL PROCESSING LETTERS 2025年 32卷 1760-1764页

作者： He, Yuan Hu, Guyue Yu, Shan Chinese Acad Sci Inst Automat Lab Brain Atlas & Brain Inspired Intelligence Beijing 100190 Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing 100049 Peoples R China Anhui Univ Sch Artificial Intelligence Hefei 230039 Peoples R China Chinese Acad Sci State Key Lab Brain Cognit & Brain Inspired Intell Beijing 100049 Peoples R China Univ Chinese Acad Sci Sch Future Technol Beijing 100049 Peoples R China

masked autoencoder (MAE) has shown remarkable potential in self-supervised representation learning for 3D point clouds. However, these methods primarily rely on point-level or low-level feature reconstruction, forcing the model to focus on local regions while lacking enough global discriminability in the feature representation. Moreover, conventional masking strategies randomly mask some point patches, thereby neglecting the semantic structure of the point cloud and hindering the holistic understanding of global information and geometric structures. To address these challenges, we proposed a Contrastive Semantic-aware masked autoencoder (Point-CSMAE), which is equipped with a semantic-aware masking (SAM) strategy and a contrastive regularization (CR) mechanism. Specifically, the semantic-aware masking strategy adaptively selects patches with richer semantic information for masking and reconstruction, enhancing the understanding of global geometric structure. Furthermore, the contrastive regularization mechanism adaptively aligns the global information between the masked and visible parts, thus improving the learned global semantic representation. Meanwhile, the CR mechanism assists the SAM strategy with effective global semantic representations. Extensive experiments on various downstream tasks, including shape classification, few-shot classification, and part segmentation, demonstrate the superiority of the proposed approach.

关键词： Point cloud compression Semantics autoencoders Three-dimensional displays Self-supervised learning Shape Image reconstruction Artificial intelligence Nearest neighbor methods Brain 3D point cloud masked autoencoder masking strategy contrastive regularization

来源：评论

学校读者我要写书评

暂无评论

Hybrid Graph Convolutional Network With Online masked autoencoder for Robust Multimodal Cancer Survival Prediction

引用

IEEE TRANSACTIONS ON MEDICAL IMAGING 2023年第8期42卷 2462-2473页

作者： Hou, Wentai Lin, Chengxuan Yu, Lequan Qin, Jing Yu, Rongshan Wang, Liansheng Xiamen Univ Sch Informat Dept Informat & Commun Engn Xiamen 361005 Peoples R China Xiamen Univ Natl Inst Data Sci Hlth & Med Xiamen 361102 Peoples R China Xiamen Univ Sch Informat Dept Comp Sci Xiamen 361005 Peoples R China Univ Hong Kong Dept Stat & Actuarial Sci Hong Kong Peoples R China Hong Kong Polytech Univ Ctr Smart Hlth Sch Nursing Hong Kong Peoples R China

Cancer survival prediction requires exploiting related multimodal information (e.g., pathological, clinical and genomic features, etc.) and it is even more challenging in clinical practices due to the incompleteness of patient's multimodal data. Furthermore, existing methods lack sufficient intra- and inter-modal interactions, and suffer from significant performance degradation caused by missing modalities. This manuscript proposes a novel hybrid graph convolutional network, entitled HGCN, which is equipped with an online masked autoencoder paradigm for robust multimodal cancer survival prediction. Particularly, we pioneer modeling the patient's multimodal data into flexible and interpretable multimodal graphs with modality-specific preprocessing. HGCN integrates the advantages of graph convolutional networks (GCNs) and a hypergraph convolutional network (HCN) through node message passing and a hyperedge mixing mechanism to facilitate intra-modal and inter-modal interactions between multimodal graphs. With HGCN, the potential for multimodal data to create more reliable predictions of patient's survival risk is dramatically increased compared to prior methods. Most importantly, to compensate for missing patient modalities in clinical scenarios, we incorporated an online masked autoencoder paradigm into HGCN, which can effectively capture intrinsic dependence between modalities and seamlessly generate missing hyperedges for model inference. Extensive experiments and analysis on six cancer cohorts from TCGA show that our method significantly outperforms the state-of-the-arts in both complete and missing modal settings. Our codes are made available at https://***/lin-lcx/HGCN.

关键词： Survival prediction multi-modal learning graph convolutional network hypergraph convolutional network masked autoencoder decision fusion

来源：评论

学校读者我要写书评

暂无评论

masked autoencoder with dynamic multi-loss adaptation mechanism for few shot wafer map pattern recognition

引用

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE 2024年 137卷

作者： Liang, Qi Zhou, Jian Wang, Yonglin Tongji Univ Sch Mech Engn Shanghai 201804 Peoples R China

Wafer Map Pattern Recognition (WMPR) is a critical aspect of semiconductor manufacturing. It indicates how to improve the manufacturing yields as we probe into the failure issues of the processes. In literature works, researchers often use balanced datasets with ample datapoints to address WMPR tasks, however, novel defects often emerge with few previous observations in real-world manufacturing. Unfortunately, efforts to solve WMPR problems in few-shot scenarios remain scanty. To bridge this gap, we define a new task, Few Shot Wafer Map Pattern Recognition(FSWMPR), which attempts to learning a classifier to distinguish unseen classes with only a few labeled instances available. In such a task, expeditiously learning transferable feature embeddings is extremely challenging. In this paper, we propose an innovative two-stage strategy to wrestle with the problem of FSWMPR. In the first stage, we leverage a masked autoencoder to obtain efficacious representations of defect wafer map images through reconstructing pixel values of masked patches based on smooth-l1 loss. In the second stage, we create a novel finetuning mechanism, "Dynamic Multi-Loss Adaptation Mechanism", which utilize three cooperative losses to accelerate fast feature transfer for few-shot scenarios. Surprisingly, if three losses are reduced to one comparative loss, we still achieve more competitive accuracy than those meta- learning or finetuning methods, which is worth noting that our two stages involve no label information at all. Extensive experiments and analyses are conducted on WM811K datasets. Compared with other algorithms, our methods offer fresh solutions by creatively integrating self-supervised masked autoencoder with a novel finetune mechanism which is efficacious for FSWMPR.

关键词： Few shot wafer map pattern recognition masked autoencoder Multi-loss Few-shot learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：