Existing speaker verification (SV) systems mainly consist of a frontend deep embedding network pretrained for speaker identification (SID) followed by a backend network finetuned to provide a similarity measure. Despi...
详细信息
Existing speaker verification (SV) systems mainly consist of a frontend deep embedding network pretrained for speaker identification (SID) followed by a backend network finetuned to provide a similarity measure. Despite their success, the performance may degrade remarkably due to domain mismatch. In this paper, we present a novel dual-branch prototypical masked autoencoder (DB-PMAE) based SRE framework. Specifically, the teacher and student branches with siamese encoders are pre-trained to jointly learn patch-level features and prototypes. A multi-task learning framework is exploited for finetuning with SID and SV tasks, where the similarity is measured by finding local correspondence to improve domain robustness. Experiments on CNCeleb corpus demonstrate the superiority of DB-PMAE.
Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the ...
详细信息
ISBN:
(纸本)9798350302615
Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.
While some powerful neural network architectures (e.g., Transformer, Graph Neural Networks) have achieved improved performance in sequential recommendation with high-order item dependency modeling, they may suffer fro...
详细信息
ISBN:
(纸本)9781450394086
While some powerful neural network architectures (e.g., Transformer, Graph Neural Networks) have achieved improved performance in sequential recommendation with high-order item dependency modeling, they may suffer from poor representation capability in label scarcity scenarios. To address the issue of insufficient labels, Contrastive Learning (CL) has attracted much attention in recent methods to perform data augmentation through embedding contrasting for self-supervision. However, due to the hand-crafted property of their contrastive view generation strategies, existing CL-enhanced models i) can hardly yield consistent performance on diverse sequential recommendation tasks;ii) may not be immune to user behavior data noise. In light of this, we propose a simple yet effective Graph masked autoencoder-enhanced sequential Recommender system (MAERec) that adaptively and dynamically distills global item transitional information for self-supervised augmentation. It naturally avoids the above issue of heavy reliance on constructing high-quality embedding contrastive views. Instead, an adaptive data reconstruction paradigm is designed to be integrated with the long-range item dependency modeling, for informative augmentation in sequential recommendation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baseline models and can learn more accurate representations against data noise and sparsity. Our implemented model code is available at https://***/HKUDS/MAERec.
Group recommendation aims to suggest items to a group of users that are suitable for the group. Although some existing powerful deep learning models have achieved improved performance, various aspects remain unexplore...
详细信息
ISBN:
(纸本)9798400704314
Group recommendation aims to suggest items to a group of users that are suitable for the group. Although some existing powerful deep learning models have achieved improved performance, various aspects remain unexplored: (1) Most existing models using contrastive learning tend to rely on high-quality data augmentation which requires precise contrastive view generation;(2) There is multifaceted natural noise in group recommendation, and additional noise is introduced during data augmentation;(3) Most existing hypergraph neural network-based models over-entangle the information of members and items, ignoring their unique characteristics. In light of this, we propose a highly effective Disentangled Hypergraph masked autoencoder-enhanced method for group recommendation (DHMAE), combining a disentangled hypergraph neural network with a graph masked autoencoder. This approach creates self-supervised signals without data augmentation by masking the features of some nodes and hyperedges and then reconstructing them. For the noise problem, we design a masking strategy that relies on pre-computed degree-sensitive probabilities for the process of masking features. Furthermore, we propose a disentangled hypergraph neural network for group recommendation scenarios to extract common messages of members and items and disentangle them during the convolution process. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models and effectively addresses the noise issue.
Malicious traffic classification is crucial for Intrusion Detection Systems (IDS). However, traditional Machine Learning approaches necessitate expert knowledge and a significant amount of well-labeled data. Although ...
详细信息
ISBN:
(纸本)9798400707650
Malicious traffic classification is crucial for Intrusion Detection Systems (IDS). However, traditional Machine Learning approaches necessitate expert knowledge and a significant amount of well-labeled data. Although recent studies have employed pre-training models from the Natural Language Processing domain, such as ET-BERT, for traffic classification, their effectiveness is impeded by limited input length and fixed Byte Pair Encoding. To address these challenges, this paper presents Flow-MAE, a pretraining model that employs masked autoencoders (MAE) from the Computer Vision domain to achieve accurate, efficient, and robust malicious network traffic classification. Flow-MAE overcomes these challenges by utilizing burst (a generic representation of network traffic) and patch embedding to accommodate extensive traffic length. Moreover, Flow-MAE introduces a self-supervised pre-training task, the masked Patch Model, which captures unbiased representations from bursts with varying lengths and patterns. Experimental results from six datasets reveal that Flow-MAE achieves new state-of-the-art accuracy (>0.99), efficiency (>900 samples/s), and robustness across diverse network traffic types. In comparison to the state-of-the-art ET-BERT, Flow-MAE exhibits improvements in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while necessitating only 0.2% FLOPs and 44% memory overhead. The efficacy of the core designs is validated through few-shot learning and ablation experiments. The code is publicly available at https://***/NLear/Flow-MAE.
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE [10] and VideoMAE [23] has proven the...
详细信息
ISBN:
(纸本)9798350342734
Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE [10] and VideoMAE [23] has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 [21] dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.
In this paper, we present a novel general-purpose audio representation learning method named Dual-Path masked autoencoder (DP-MAE) for anomalous sound detection (ASD) task. Existing methods mainly focus on frame-level...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
In this paper, we present a novel general-purpose audio representation learning method named Dual-Path masked autoencoder (DP-MAE) for anomalous sound detection (ASD) task. Existing methods mainly focus on frame-level generative methods or clip-level discriminative methods, which generally ignore the local information where anomalies are usually found more easily. Moreover, they apply multiple systems on one ASD task, which is lacking in generalizability. For tracking this, our method extracts patch-level features to learn unified audio representation that generalizes well and models local information that is beneficial to detecting anomalies under domain shifts by self-supervised representation learning and it further optimizes the informativeness of clip-level representations in fine-tuning. Concretely, the input spectrograms are randomly split into two patch-level subsets, and then they are fed into DP-MAE to predict each other. Meanwhile, the output of one path is also considered to be the predicted objective of the other path to perform regularization from the perspective of self-distillation. In fine-tuning stage, a linear classifier is applied on the features produced by the encoder to get a more compact representation of normal sound. Experiments on DCASE 2022 Challenge Task2 development dataset show the effectiveness of our method.
Generally speaking, abnormal images are distinguished from normal images in terms of content or semantics. Image anomaly detection is the task of identifying anomalous images that deviate from normal images. Reconstru...
详细信息
ISBN:
(纸本)9789819916443;9789819916450
Generally speaking, abnormal images are distinguished from normal images in terms of content or semantics. Image anomaly detection is the task of identifying anomalous images that deviate from normal images. Reconstruction based methods detect anomaly using the difference between the original image and the reconstructed image. These methods assume that the model will be unable to properly reconstruct anomalous images. But in practice, anomalous regions are often reconstructed well due to the network's generalization ability. Recent methods propose to decrease this effect by turning the generative task to an inpainting problem. By conditioning on the neighborhood of the masked part, small anomalies will not contribute to the reconstrued image. However, it is hard to reconstruct the masked regions when neighborhood exists much anomalous information. We suggest that it should include more useful information of the image when doing inpainting. Inspired by masked autoencoder (MAE), we propose a new anomaly detection method, which called MAE-AD. The architecture of the method can learn global information of the image, and it can avoid being affected by the large anomalous region. We evaluate our method on the MVTec AD dataset, and the results outperform the previous inpainting based approach. In comparison with the methods which use pre-trained models, MAE-AD also has a competitive performance.
Predictor-based Neural Architecture Search (NAS) offers a promising solution for enhancing the efficiency of traditional NAS methods. However, it is non-trivial to train the predictor with limited architecture evaluat...
详细信息
ISBN:
(纸本)9789819985425;9789819985432
Predictor-based Neural Architecture Search (NAS) offers a promising solution for enhancing the efficiency of traditional NAS methods. However, it is non-trivial to train the predictor with limited architecture evaluations for efficient NAS. While current approaches typically focus on better utilizing the labeled architectures, the valuable knowledge contained in unlabeled data remains unexplored. In this paper, we propose a self-supervised transformer-based model that effectively leverages unlabeled data to learn meaningful representations of neural architectures, reducing the reliance on labeled data to train a high-performance predictor. Specifically, the predictor is pre-trained with a masking strategy to reconstruct input features in both latent and raw data spaces. To further enhance its representative capability, we introduce a multi-head attention-masking mechanism that guides the model to attend to different representation subspaces from both explicit and implicit perspectives. Extensive experimental results on NAS-Bench-101, NAS-Bench-201 and NAS-Bench-301 demonstrate that our predictor requires less labeled data and achieves superior performance compared to existing predictors. Furthermore, when combined with search strategies, our predictor exhibits promising capability in discovering high-quality architectures.
Despite the advances in deep learning techniques, accurate identification using face recognition (FR) systems remains challenging owing to changes in face angles, bad lighting, and occlusions. To address these problem...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
Despite the advances in deep learning techniques, accurate identification using face recognition (FR) systems remains challenging owing to changes in face angles, bad lighting, and occlusions. To address these problems, we propose an optimized approach to improve the robustness of feature extraction models that are used in FR systems. The proposed method leverages an angle-aware loss function, inspired by ArcFace, that provides a large margin for significantly rotated faces. Additionally, a pre-trained weight initialization was derived from a masked autoencoder to enhance the ability of the model to cope with various poor conditions. The experimental results indicate that the proposed method outperforms existing face recognition methods in both normal and adverse environments.
暂无评论