检索结果-内蒙古大学图书馆

33rd International Conference on Artificial Neural Networks and Machine learning (ICANN)

作者： Geng, Qingwei Gu, Xiaodong Fudan Univ Dept Elect Engn Shanghai 200438 Peoples R China

ISBN: (纸本)9783031723377;9783031723384

A more fine-grained video spatial localization task audio visual segmentation(AVS) has recently been proposed, which aims to generate the masks of the sounding objects that sound in the given videos. In this paper, we propose a novel network for AVS. Specifically, to capture comprehensive global auditory semantic information and facilitate its interaction with visual frames, our model incorporates a transformer-based audio-visual context encoder, which is designed to generate a pixel-level score map enriched with auditory contextualized information for the decoder. In addition, to address challenges of the vague boundaries of the sounding object in videos, we introduce a refinement module as well as structural similarity(SSIM) loss to enhance accurate boundary predictions. Extensive experiments on the AVSBench dataset show that our proposed method surpasses the baseline AVSBench as well as some advanced methods from other tasks.

关键词： audio-visual learning audio visual Segmentation Multi-Modal learning

来源：评论

学校读者我要写书评

暂无评论

FOLEYGEN: visualLY-GUIDED audio GENERATION 34

FOLEYGEN: VISUALLY-GUIDED AUDIO GENERATION

引用

34th International Workshop on Machine learning for Signal Processing

作者： Mei, Xinhao Nagaraj, Varun Le Lant, Gael Ni, Zhaoheng Chang, Ernie Shi, Yangyang Chandrakumar, Vikas Meta Menlo Pk CA 94025 USA Univ Surrey Guildford Surrey England

ISBN: (纸本)9798350372267;9798350372250

Recent advancements in audio generation tasks, such as text-to-audio and text-to-music generation, have been spurred by the evolution of deep learning models and large-scale datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. FoleyGen features two distinct versions, differentiated by how the visual features extracted by the visual encoder interact with the Transformer model. FoleyGen-C employs a cross-attention module that enables audio tokens to attend to visual features. In contrast, FoleyGen-P appends visual features directly to the audio tokens, allowing interactions within the self-attention mechanism of the Transformer. A significant challenge in V2A generation is the misalignment of generated audio with corresponding visual actions. To address this, we develop three visual attention mechanisms to assess their impact on audio-visual synchronization. Additionally, we further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen models outperforms previous systems across all objective metrics and human evaluations. The audio samples can be found in our demo page: https://***/foleygen_demo/.

关键词： Sound generation audio-visual learning video-to-audio generation multimodal learning

来源：评论

学校读者我要写书评

暂无评论

T-VSL: Text-Guided visual Sound Source Localization in Mixtures

T-VSL: Text-Guided Visual Sound Source Localization in Mixtu...

引用

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Mahmud, Tanvir Tian, Yapeng Marculescu, Diana Univ Texas Austin Austin TX 78712 USA Univ Texas Dallas Dallas TX 75080 USA

ISBN: (纸本)9798350353006

visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., audioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal audioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://***/enyac-group/T-VSL/tree/main.

关键词： audio-visual learning CLIP Multi-modal Foundation Model Sound Source Localization

来源：评论

学校读者我要写书评

暂无评论

Individual differences in the acquisition of non-linguistic audio-visual associations in 5 year olds

引用

DEVELOPMENTAL SCIENCE 2020年第4期23卷 e12913页

作者： Altarelli, Irene Dehaene-Lambertz, Ghislaine Bavelier, Daphne Univ Paris Sud Univ Paris Saclay NeuroSpin CtrINSERM CEA DRF Inst JoliotCognit Neuroimaging Unit U992 Gif Sur Yvette France Univ Geneva Fac Psychol & Educ Sci Geneva Switzerland Univ Paris 05 Univ Paris Lab Psychol Child Dev & Educ LaPsyDE CNRS UMR 8240 Paris France

audio-visual associative learning - at least when linguistic stimuli are employed - is known to rely on core linguistic skills such as phonological awareness. Here we ask whether this would also be the case in a task that does not manipulate linguistic information. Another question of interest is whether executive skills, often found to support learning, may play a larger role in a non-linguistic audio-visual associative task compared to a linguistic one. We present a new task that measures learning when having to associate non-linguistic auditory signals with novel visual shapes. Importantly, our novel task shares with linguistic processes such as reading acquisition the need to associate sounds with arbitrary shapes. Yet, rather than phonemes or syllables, it uses novel environmental sounds - therefore limiting direct reliance on linguistic abilities. Five-year-old French-speaking children (N = 76, 39 girls) were assessed individually in our novel audio-visual associative task, as well as in a number of other cognitive tasks evaluating linguistic abilities and executive functions. We found phonological awareness and language comprehension to be related to scores in the audio-visual associative task, while no correlation with executive functions was observed. These results underscore a key relation between foundational language competencies and audio-visual associative learning, even in the absence of linguistic input in the associative task.

关键词： associative learning audio-visual learning cognitive correlates individual differences language skills pre-readers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：