检索结果-内蒙古大学图书馆

32nd European Signal Processing Conference (EUSIPCO)

作者： Ohta, Takezo Bando, Yoshiaki Imoto, Keisuke Onishi, Masaki Univ Tsukuba Grad Sch Syst & Informat Engn Tsukuba Ibaraki Japan Natl Inst Adv Ind Sci & Technol Tokyo Japan Doshisha Univ Dept Informat Syst Design Kyoto Japan

ISBN: (纸本)9789464593617;9798331519773

In this paper, we propose an audio spectrogram transformer (AST) for sequential inference and evaluate its real-time performance. ASTs are pre-trained in a self-supervised manner, such as masked autoencoding, and the pre-trained models are well-performing in sound event detection. However, the existing architectures are designed for offline inference, wherein the entire signal serves as the input, and are unsuitable for sequential inference as they require the input sequence to be split into short chunks. In this study, we design a sequential AST based on a memory token (MT-AST) and its training method and conduct comprehensive experiments regarding the chunk length configuration. Specifically, we extend the offline AST with special tokens that memorize past signal information so that the network avoids repetitive inference of the same signal. While our model has limited inference capability, we train it using knowledge distillation from BEATs, a large-scale pre-trained model. Compared to the offline architecture, our model achieved higher performance by pre-training with audioSet and fine-tuning for the URBAN-SED and DESED datasets. In addition, we conducted experiments to investigate the input chunk length considering performance-latency trade-offs and revealed the optimal configurations. We revealed that our model requires at least one extra second of input to maintain the performance.

关键词： Sound event detection audio spectrogram transformer sequential inference

来源：评论

学校读者我要写书评

暂无评论

FastAST: Accelerating audio spectrogram transformer via Token Merging and Cross-Model Knowledge Distillation 25

FastAST: Accelerating Audio Spectrogram Transformer via Toke...

引用

25th Interspeech Conference

作者： Behera, Swarup Ranjan Dhiman, Abhishek Gowda, Karthik Narayani, Aalekhya Satya Reliance Jio AICoE Hyderabad India

audio classification models, particularly the audio spectrogram transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we introduce FastAST, a framework that integrates Token Merging (ToMe) into the AST framework. FastAST enhances inference speed without requiring extensive retraining by merging similar tokens in audio spectrograms. Furthermore, during training, FastAST brings about significant speed improvements. The experiments indicate that FastAST can increase audio classification throughput with minimal impact on accuracy. To mitigate the accuracy impact, we integrate Cross-Model Knowledge Distillation (CMKD) into the FastAST framework. Integrating ToMe and CMKD into AST results in improved accuracy compared to AST while maintaining faster inference speeds. FastAST represents a step towards real-time, resource-efficient audio analysis.

关键词： audio spectrogram transformer token merging cross model knowledge distillation

来源：评论

学校读者我要写书评

暂无评论

Patch-Mix Contrastive Learning with audio spectrogram transformer on Respiratory Sound Classification 24

Patch-Mix Contrastive Learning with Audio Spectrogram Transf...

引用

Interspeech Conference

作者： Bae, Sangmin Kim, June-Woo Cho, Won-Yang Baek, Hyerim Son, Soyoun Lee, Byungjo Ha, Changwan Tae, Kyongpil Kim, Sungnyun Yun, Se-Young KAIST AI Daejeon South Korea Kyungpook Natl Univ Dept AI Daegu South Korea SmartSound Seoul South Korea Dongguk Univ Seoul South Korea MODULABS Seoul South Korea

Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases;however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with audio spectrogram transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.

关键词： Respiratory Sound Classification ICBHI audio spectrogram transformer Patch-Mix Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

Adapter Incremental Continual Learning of Efficient audio spectrogram transformers 24

Adapter Incremental Continual Learning of Efficient Audio Sp...

引用

Interspeech Conference

作者： Selvaraj, Nithish Muthuchamy Guo, Xiaobao Kong, Adams Shen, Bingquan Kot, Alex Nanyang Technol Univ Rapid Rich Object Search ROSE Lab Singapore Singapore Nanyang Technol Univ Sch Comp Sci & Engn Singapore Singapore DSO Natl Labs Singapore Singapore

Efficient tuning of neural networks for continual learning with minimal computational resources remains a challenge. In this paper, we propose continual learning of audio classifiers with parameter and compute efficient audio spectrogram transformers (AST). To reduce the trainable parameters without performance degradation we propose AST with Convolutional Adapter, which has less than 5% of trainable parameters of full fine-tuning. To reduce the computational complexity of self-attention, we introduce a novel Frequency-Time factorized Attention (FTA) method that achieves competitive performance with only a factor of the computations. Finally, we formulate our method called Adapter Incremental Continual Learning (AI-CL), as a combination of the parameter-efficient Convolutional Adapter and the compute-efficient FTA. Experiments on ESC-50, SpeechCommandsV2, and audioVisual Event benchmarks show that our proposed method efficiently learns new tasks and prevents catastrophic forgetting. Code is available at https://***/NMS05/Adapter-Incremental-Continual-Learning-AST.

关键词： Continual Learning audio spectrogram transformer Adapter Self-Attention

来源：评论

学校读者我要写书评

暂无评论

PARAMETER-EFFICIENT TRANSFER LEARNING OF audio spectrogram transformerS 34

PARAMETER-EFFICIENT TRANSFER LEARNING OF AUDIO SPECTROGRAM T...

引用

34th International Workshop on Machine Learning for Signal Processing

作者： Cappellazzo, Umberto Falavigna, Daniele Brutti, Alessio Ravanelli, Mirco Univ Trento Trento Italy Fdn Bruno Kessler Trento Italy Concordia Univ Montreal PQ Canada

ISBN: (纸本)9798350372267;9798350372250

Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and dispensing with the issue of storing a copy of the pre-trained model for each task. For audio classification tasks, the audio spectrogram transformer (AST) model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common PETL methods for the adaptation of the AST model to audio/speech tasks. Furthermore, we propose a new adapter design that exploits the convolution module of the Conformer model, leading to superior performance over the standard PETL approaches and surpassing or achieving performance parity with full fine-tuning by updating only 0.29% of the parameters. Finally, we provide ablation studies revealing that our proposed adapter: 1) proves to be effective in few-shot efficient transfer learning, 2) attains optimal results regardless of the amount of the allocated parameters, and 3) can be applied to other pre-trained models. Our code is available at https: //***/umbertocappellazzo/PETL_AST.

关键词： Parameter-Efficient Transfer Learning audio spectrogram transformer LoRA Adapters Depthwise Convolution

来源：评论

学校读者我要写书评

暂无评论

Efficient Fine-tuning of audio spectrogram transformers via Soft Mixture of Adapters 25

Efficient Fine-tuning of Audio Spectrogram Transformers via ...

引用

25th Interspeech Conference

作者： Cappellazzo, Umberto Falavigna, Daniele Brutti, Alessio Univ Trento Trento Italy Fdn Bruno Kessler Trento Italy

Mixture of Experts (MoE) architectures have recently started burgeoning due to their ability to scale model's capacity while maintaining the computational cost affordable, leading to state-of-the-art results in numerous fields. While MoE has been mostly investigated for the pre-training stage, its use in parameter-efficient transfer learning (PETL) settings is under-explored. To narrow this gap, this paper attempts to demystify the use of MoE for PETL of audio spectrogram transformers to audio and speech downstream tasks. Specifically, we propose Soft Mixture of Adapters (Soft-MoA). It exploits adapters as the experts and, leveraging the recent Soft MoE method, it relies on a soft assignment between the input tokens and experts to keep the computational time limited. Extensive experiments across 4 benchmarks demonstrate that Soft-MoA outperforms the single adapter method and performs on par with the dense MoA counterpart. We finally present ablation studies on key elements of Soft-MoA. Our code is available at https://***/umbertocappellazzo/PETL_AST.

关键词： audio spectrogram transformer Efficient Fine-tuning Adapters Mixture of Experts Soft Mixture of Adapters

来源：评论

学校读者我要写书评

暂无评论

Dynamic Convolutional Neural Networks as Efficient Pre-Trained audio Models

引用

IEEE-ACM TRANSACTIONS ON audio SPEECH AND LANGUAGE PROCESSING 2024年 32卷 2227-2241页

作者： Schmid, Florian Koutini, Khaled Widmer, Gerhard Johannes Kepler Univ Linz Inst Computat Percept CP JKU A-4040 Linz Austria Johannes Kepler Univ Linz LIT Artificial Intelligence Lab A-4040 Linz Austria

The introduction of large-scale audio datasets, such as audioSet, paved the way for transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. audio spectrogram transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular audio spectrogram transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks constructed of dynamic convolutions, a dynamic ReLU activation function, and Coordinate Attention. We show that these dynamic CNNs outperform traditional efficient CNNs, such as MobileNets, in terms of the performance-complexity trade-off at the task of audio tagging on the large-scale audioSet. Our experiments further indicate that the proposed dynamic CNNs achieve competitive performance with transformer-based models for end-to-end fine-tuning on downstream tasks while being much more computationally efficient.

关键词： Dynamic convolutional neural networks dynamic convolution dynamic ReLU coordinate attention audio spectrogram transformer audio classification pre-trained audio models knowledge distillation

来源：评论

学校读者我要写书评

暂无评论

Sound Tagging in Infant-centric Home Soundscapes

Sound Tagging in Infant-centric Home Soundscapes

引用

9th IEEE/ACM International Conference on Connected Health - Applications, Systems and Engineering Technologies (CHASE)

作者： Khan, Mohammad Nur Hossain Li, Jialu McElwain, Nancy L. Hasegawa-Johnson, Mark Islam, Bashima Worcester Polytech Inst Dept Elect & Comp Engn Worcester MA 01609 USA Univ Illinois Dept Elect & Comp Engn Champaign IL USA Univ Illinois Dept Human Dev & Family Studies Champaign IL USA Univ Illinois Beckman Inst Adv Sci & Technol Champaign IL USA

ISBN: (纸本)9798350345025;9798350345018

Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (audio spectrogram transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.

关键词： Infant-centric soundscape audio spectrogram transformer domestic sound event detection pretrained model

来源：评论

学校读者我要写书评

暂无评论

AVR: Synergizing Foundation Models for audio-Visual Humor Detection 25

AVR: Synergizing Foundation Models for Audio-Visual Humor De...

引用

25th Interspeech Conference

作者： Sharma, Sarthak Phukan, Orchid Chetia Singh, Drishti Buduru, Arun Balaji Sharma, Rajesh IIIT Delhi Delhi India Univ Tartu Tartu Estonia

In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in real-world applications. To address this bottleneck, we propose an innovative audio-visual humor detection system that circumvents textual reliance, eliminating the need for ASR models. Instead, the proposed approach hinges on the intricate interplay between audio and visual content for effective humor detection.

关键词： audio-Visual Humor Detection Multimodal System VideoMAE audio spectrogram transformer Languagebind

来源：评论

学校读者我要写书评

暂无评论

Fully Few-shot Class-incremental audio Classification Using Expandable Dual-embedding Extractor 25

Fully Few-shot Class-incremental Audio Classification Using ...

引用

25th Interspeech Conference

作者： Si, Yongjie Li, Yanxiong Li, Jialong Tan, Jiaxin He, Qianhua South China Univ Technol Sch Elect & Informat Engn Guangzhou Peoples R China

It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessions. Moreover, we propose a method using expandable dual-embedding extractor to solve it. The proposed model consists of an embedding extractor and an expandable classifier. The embedding extractor consists of a pretrained audio spectrogram transformer (AST) and a finetuned AST. The expandable classifier consists of prototypes and each prototype represents a class. Experiments are conducted on three datasets (LS-100, NSynth-100 and FSC-89). Results show that our method exceeds seven baseline ones in average accuracy with statistical significance. Code is at: https://***/YongjieSi/EDE.

关键词： Few-shot learning incremental learning audio classification expandable dual-embedding extractor audio spectrogram transformer

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：