A more fine-grained video spatial localization task audiovisual segmentation(AVS) has recently been proposed, which aims to generate the masks of the sounding objects that sound in the given videos. In this paper, we...
详细信息
ISBN:
(纸本)9783031723377;9783031723384
A more fine-grained video spatial localization task audiovisual segmentation(AVS) has recently been proposed, which aims to generate the masks of the sounding objects that sound in the given videos. In this paper, we propose a novel network for AVS. Specifically, to capture comprehensive global auditory semantic information and facilitate its interaction with visual frames, our model incorporates a transformer-based audio-visual context encoder, which is designed to generate a pixel-level score map enriched with auditory contextualized information for the decoder. In addition, to address challenges of the vague boundaries of the sounding object in videos, we introduce a refinement module as well as structural similarity(SSIM) loss to enhance accurate boundary predictions. Extensive experiments on the AVSBench dataset show that our proposed method surpasses the baseline AVSBench as well as some advanced methods from other tasks.
Recent advancements in audio generation tasks, such as text-to-audio and text-to-music generation, have been spurred by the evolution of deep learning models and large-scale datasets. However, the task of video-to-aud...
详细信息
ISBN:
(纸本)9798350372267;9798350372250
Recent advancements in audio generation tasks, such as text-to-audio and text-to-music generation, have been spurred by the evolution of deep learning models and large-scale datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. FoleyGen features two distinct versions, differentiated by how the visual features extracted by the visual encoder interact with the Transformer model. FoleyGen-C employs a cross-attention module that enables audio tokens to attend to visual features. In contrast, FoleyGen-P appends visual features directly to the audio tokens, allowing interactions within the self-attention mechanism of the Transformer. A significant challenge in V2A generation is the misalignment of generated audio with corresponding visual actions. To address this, we develop three visual attention mechanisms to assess their impact on audio-visual synchronization. Additionally, we further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen models outperforms previous systems across all objective metrics and human evaluations. The audio samples can be found in our demo page: https://***/foleygen_demo/.
visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods stru...
详细信息
ISBN:
(纸本)9798350353006
visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., audioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal audioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://***/enyac-group/T-VSL/tree/main.
audio-visual associative learning - at least when linguistic stimuli are employed - is known to rely on core linguistic skills such as phonological awareness. Here we ask whether this would also be the case in a task ...
详细信息
audio-visual associative learning - at least when linguistic stimuli are employed - is known to rely on core linguistic skills such as phonological awareness. Here we ask whether this would also be the case in a task that does not manipulate linguistic information. Another question of interest is whether executive skills, often found to support learning, may play a larger role in a non-linguistic audio-visual associative task compared to a linguistic one. We present a new task that measures learning when having to associate non-linguistic auditory signals with novel visual shapes. Importantly, our novel task shares with linguistic processes such as reading acquisition the need to associate sounds with arbitrary shapes. Yet, rather than phonemes or syllables, it uses novel environmental sounds - therefore limiting direct reliance on linguistic abilities. Five-year-old French-speaking children (N = 76, 39 girls) were assessed individually in our novel audio-visual associative task, as well as in a number of other cognitive tasks evaluating linguistic abilities and executive functions. We found phonological awareness and language comprehension to be related to scores in the audio-visual associative task, while no correlation with executive functions was observed. These results underscore a key relation between foundational language competencies and audio-visual associative learning, even in the absence of linguistic input in the associative task.
暂无评论