检索结果-内蒙古大学图书馆

49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Liu, Miao Wang, Jing Qian, Xinyuan Xie, Xiang Beijing Inst Technol Beijing Peoples R China Univ Sci & Technol Beijing Beijing Peoples R China

ISBN: (纸本)9798350344868;9798350344851

Binaural audio delivers an immersive spatial auditory experience to human listeners, but most existing videos lack binaural audio due to the expertise required for recording environments. Recent studies have been dedicated to converting monaural audio into binaural ones conditioned on the visual inputs. In this paper, we propose a novel audio-visual spatialization network with two added audio decoders, which rely on carefully designed visual features to generate audio outputs for the left and right channels, respectively. In addition, we propose an audio-visual matching loss to further explore the correlation between binaural audio and the scene visual input. Experiment results show that the proposed method outperforms several state-of-the-art binaural audio generation methods on two benchmark datasets FAIR-Play and MUSICStereo. Qualitative results are also presented to demonstrate the effectiveness of the proposed method.

关键词： binaural audio generation audio-visual learning cross-modal consistency

来源：评论

学校读者我要写书评

暂无评论

Motion Based audio-visual Segmentation 25

Motion Based Audio-Visual Segmentation

引用

25th Interspeech Conference

作者： Li, Jiahao Liu, Miao Yang, Shu Wang, Jing Xie, Xiang Beijing Inst Technol Beijing Peoples R China Tsinghua Univ Beijing Peoples R China

Recently, a novel task called audio-visual segmentation (AVS) has emerged, focusing on pixel-wise segmentation of sounding objects in videos. This task is particularly challenging as it involves segmenting individual pixels based on objects in video frames accompanied by sound. We propose a Motion Based audio-visual Segmentation model, which incorporates optical flow maps with motion information into the AVS task for the first time. The Motion-Vision Attention Module (MVA) is proposed to facilitate the fusion of motion and visual features to exploit motion information. Additionally, the Cross-Modal Bilateral-Attention Module (CMBA) is introduced to integrate multimodal features through crossmodal attention. The proposed model is evaluated on two distinct datasets, S4 and MS3, the outperformance of which demonstrates its effectiveness and feasibility in addressing the AVS task.

关键词： audio-visual learning audio-visual segmentation multi-modal learning

来源：评论

学校读者我要写书评

暂无评论

TIM: A Time Interval Machine for audio-visual Action Recognition

TIM: A Time Interval Machine for Audio-Visual Action Recogni...

引用

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Chalk, Jacob Huh, Jaesung Kazakos, Evangelos Zisserman, Andrew Damen, Dima Univ Bristol Bristol Avon England Univ Oxford VGG Oxford England Czech Tech Univ Prague Czech Republic

ISBN: (纸本)9798350353006

Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre- training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://***/JacobChalk/TIM.

关键词： action detection action recognition audio-visual learning egocentric videos video understanding

来源：评论

学校读者我要写书评

暂无评论

AV-SUPERB: A MULTI-TASK EVALUATION BENCHMARK FOR audio-visual REPRESENTATION MODELS 49

AV-SUPERB: A MULTI-TASK EVALUATION BENCHMARK FOR AUDIO-VISUA...

引用

49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Tseng, Yuan Berry, Layne Chen, Yi-Ting Chiu, I-Hsiang Lin, Hsuan-Hao Liu, Max Peng, Puyuan Shih, Yi-Jen Wang, Hung-Yu Wu, Haibin Huang, Po-Yao Lai, Chun-Mao Li, Shang-Wen Harwath, David Tsao, Yu Mohamed, Abdelrahman Feng, Chi-Luen Lee, Hung-Yi Natl Taiwan Univ Taipei Taiwan Univ Texas Austin Austin TX USA Acad Sinica Taipei Taiwan Meta AI Toronto ON Canada Rembrand Palo Alto CA USA

ISBN: (纸本)9798350344868;9798350344851

audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with audioSet serves as a strong intermediate task. We release our benchmark with evaluation code(1) and a model submission platform(2) to encourage further research in audio-visual learning.

关键词： audio-visual learning Representation learning Evaluation Self-Supervised learning

来源：评论

学校读者我要写书评

暂无评论

learning SOUND LOCALIZATION BETTER FROM SEMANTICALLY SIMILAR SAMPLES 47

LEARNING SOUND LOCALIZATION BETTER FROM SEMANTICALLY SIMILAR...

引用

47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Senocak, Arda Ryu, Hyeonggon Kim, Junsik Kweon, In So Korea Adv Inst Sci & Technol Daejeon South Korea Harvard Univ Cambridge MA 02138 USA

ISBN: (纸本)9781665405409

The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard positives", are mistakenly grouped as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. Our approach incorporates these hard positives by adding their response maps into a contrastive learning objective directly. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets, showing favorable performance to the state-of-the-art methods.

关键词： audio-visual learning audio-visual sound localization audio-visual correspondence self-supervised

来源：评论

学校读者我要写书评

暂无评论

T-VSL: Text-Guided visual Sound Source Localization in Mixtures

T-VSL: Text-Guided Visual Sound Source Localization in Mixtu...

引用

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Mahmud, Tanvir Tian, Yapeng Marculescu, Diana Univ Texas Austin Austin TX 78712 USA Univ Texas Dallas Dallas TX 75080 USA

ISBN: (纸本)9798350353006

visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., audioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal audioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://***/enyac-group/T-VSL/tree/main.

关键词： audio-visual learning CLIP Multi-modal Foundation Model Sound Source Localization

来源：评论

学校读者我要写书评

暂无评论

FOLEYGEN: visualLY-GUIDED audio GENERATION 34

FOLEYGEN: VISUALLY-GUIDED AUDIO GENERATION

引用

34th International Workshop on Machine learning for Signal Processing

作者： Mei, Xinhao Nagaraj, Varun Le Lant, Gael Ni, Zhaoheng Chang, Ernie Shi, Yangyang Chandrakumar, Vikas Meta Menlo Pk CA 94025 USA Univ Surrey Guildford Surrey England

ISBN: (纸本)9798350372267;9798350372250

Recent advancements in audio generation tasks, such as text-to-audio and text-to-music generation, have been spurred by the evolution of deep learning models and large-scale datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. FoleyGen features two distinct versions, differentiated by how the visual features extracted by the visual encoder interact with the Transformer model. FoleyGen-C employs a cross-attention module that enables audio tokens to attend to visual features. In contrast, FoleyGen-P appends visual features directly to the audio tokens, allowing interactions within the self-attention mechanism of the Transformer. A significant challenge in V2A generation is the misalignment of generated audio with corresponding visual actions. To address this, we develop three visual attention mechanisms to assess their impact on audio-visual synchronization. Additionally, we further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen models outperforms previous systems across all objective metrics and human evaluations. The audio samples can be found in our demo page: https://***/foleygen_demo/.

关键词： Sound generation audio-visual learning video-to-audio generation multimodal learning

来源：评论

学校读者我要写书评

暂无评论

learning to Localize Sound Sources in visual Scenes: Analysis and Applications

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021年第5期43卷 1605-1619页

作者： Senocak, Arda Oh, Tae-Hyun Kim, Junsik Yang, Ming-Hsuan Kweon, In So Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon 34141 South Korea POSTECH Dept Elect Engn Pohang 37673 South Korea Univ Calif Dept Elect Engn & Comp Sci Merced CA 95343 USA

visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

关键词： visualization Videos Task analysis Correlation Deep learning Network architecture Unsupervised learning audio-visual learning sound localization self-supervision multi-modal learning cross-modal retrieval

来源：评论

学校读者我要写书评

暂无评论

visually-Aware audio Captioning With Adaptive audio-visual Attention 24

Visually-Aware Audio Captioning With Adaptive Audio-Visual A...

引用

Interspeech Conference

作者： Liu, Xubo Huang, Qiushi Mei, Xinhao Liu, Haohe Kong, Qiuqiang Sun, Jianyuan Li, Shengchen Ko, Tom Zhang, Yu Tang, Lilian H. Plumbley, Mark D. Kilic, Volkan Wang, Wenwu Univ Surrey Guildford Surrey England ByteDance Beijing Peoples R China Xian Jiaotong Liverpool Univ Xian Peoples R China Southern Univ Sci & Technol Shenzhen Peoples R China Izmir Katip Celebi Univ Izmir Turkiye

audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on audioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.

关键词： audio captioning audio-visual learning attention mechanism multimodal learning

来源：评论

学校读者我要写书评

暂无评论

PEANUT: A Human-AI Collaborative Tool for Annotating audio-visual Data 23

PEANUT: A Human-AI Collaborative Tool for Annotating Audio-V...

引用

36th Annual ACM Symposium on User Interface Software and Technology (UIST)

作者： Ning, Zheng Zhang, Zheng Xu, Chenliang Tian, Yapeng Li, Toby Jia-Jun Univ Notre Dame Notre Dame IN 46556 USA Univ Rochester Rochester NY USA Univ Texas Dallas Richardson TX 75083 USA

ISBN: (纸本)9798400701320

audio-visual learning seeks to enhance the computers multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high-quality datasets. Annotating audio-visual datasets is laborious, expensive, and time-consuming. To address this challenge, we designed and developed an efficient audio-visual annotation tool called Peanut. Peanuts human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.

关键词： human-AI collaboration data annotation data labeling audio-visual learning interactive machine learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：