Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range ...
详细信息
Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose Local-Global Audio Spectrogram vIsion Transformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
Vision transformers (ViTs) have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range gl...
详细信息
ISBN:
(纸本)9781728198354
Vision transformers (ViTs) have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry and therefore often pretrained on large-scale datasets, e.g. JFT-300M or ImageNet. An ideal learning method would perform best regardless of the size of the dataset, a property lacked by current learning methods, with merely a few existing works studying ViTs with limited data. We propose group masked model learning (GMML), a self-supervised learning (SSL) method that is able to train ViTs and achieve state-of-the-art (SOTA) performance when pre-trained with limited data. The GMML uses the information conveyed by all concepts in the image. This is achieved by manipulating randomly groups of connected tokens, successively covering different meaningful parts of the image content, and then recovering the hidden information from the visible part of the concept. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor relies on careful implementation details such as large batches and gradient stopping. Pretraining, finetuning, and evaluation codes are available under: https://***/GMML.
The availability of large scale data with high quality ground truth labels is a challenge when developing supervised machine learning solutions for healthcare domain. Although, the amount of digital data in clinical w...
详细信息
ISBN:
(纸本)9783031167607;9783031167591
The availability of large scale data with high quality ground truth labels is a challenge when developing supervised machine learning solutions for healthcare domain. Although, the amount of digital data in clinical workflows is increasing, most of this data is distributed on clinical sites and protected to ensure patient privacy. Radiological readings and dealing with large-scale clinical data puts a significant burden on the available resources, and this is where machine learning and artificial intelligence play a pivotal role. Magnetic Resonance Imaging (MRI) for musculoskeletal (MSK) diagnosis is one example where the scans have a wealth of information, but require a significant amount of time for reading and labeling. Self-supervised learning (SSL) can be a solution for handling the lack of availability of ground truth labels, but generally requires a large amount of training data during the pretraining stage. Herein, we propose a slice-based self-supervised deep learning framework (SB-SSL), a novel slice-based paradigm for classifying abnormality using knee MRI scans. We show that for a limited number of cases (<1000), our proposed framework is capable to identify anterior cruciate ligament tear with an accuracy of 89.17% and an AUC of 0.954, outperforming state-of-the-art without usage of external data during pretraining. This demonstrates that our proposed framework is suited for SSL in the limited data regime.
暂无评论