检索结果-内蒙古大学图书馆

Multisensory Congruency Enhances Explicit Awareness in a Sequence learning Task

MULTISENSORY RESEARCH 2017年第7-8期30卷 681-689页

作者： Silva, Andrew E. Barakat, Brandon K. Jimenez, Luis O. Shams, Ladan Univ Calif Los Angeles Los Angeles CA 90095 USA

We examined the effect of audiovisual training on learning a repeated sequence of motor responses. Participants were trained with either congruent or incongruent audiovisual cues to produce motor responses. learning was tested by comparing reaction times to untrained sequences and by asking participants to recreate the trained sequence. A strong association was found between the two measures and the majority of high-scoring participants belonged to the congruent audiovisual condition. Because the second measure requires explicit knowledge of the trained sequence, we conclude that audiovisual congruency facilitates explicit learning.

关键词： Sequence learning audio-visual learning serial reaction time

来源：评论

学校读者我要写书评

暂无评论

learning to Localize Sound Sources in visual Scenes: Analysis and Applications

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021年第5期43卷 1605-1619页

作者： Senocak, Arda Oh, Tae-Hyun Kim, Junsik Yang, Ming-Hsuan Kweon, In So Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon 34141 South Korea POSTECH Dept Elect Engn Pohang 37673 South Korea Univ Calif Dept Elect Engn & Comp Sci Merced CA 95343 USA

visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e., semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

关键词： visualization Videos Task analysis Correlation Deep learning Network architecture Unsupervised learning audio-visual learning sound localization self-supervision multi-modal learning cross-modal retrieval

来源：评论

学校读者我要写书评

暂无评论

learning to visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Learning to Visually Localize Sound Sources from Mixtures wi...

引用

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Kim, Dongjin Um, Sung Jin Lee, Sangmin Kim, Jung Uk Kyung Hee Univ Seoul South Korea Univ Illinois Urbana IL 61801 USA

ISBN: (纸本)9798350353006

The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://***/visualAIKHU/NoPrior_MultiSSL.

关键词： audio-visual learning Multimodal learning Sound Source Localization

来源：评论

学校读者我要写书评

暂无评论

learning SOUND LOCALIZATION BETTER FROM SEMANTICALLY SIMILAR SAMPLES 47

LEARNING SOUND LOCALIZATION BETTER FROM SEMANTICALLY SIMILAR...

引用

47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Senocak, Arda Ryu, Hyeonggon Kim, Junsik Kweon, In So Korea Adv Inst Sci & Technol Daejeon South Korea Harvard Univ Cambridge MA 02138 USA

ISBN: (纸本)9781665405409

The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard positives", are mistakenly grouped as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. Our approach incorporates these hard positives by adding their response maps into a contrastive learning objective directly. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets, showing favorable performance to the state-of-the-art methods.

关键词： audio-visual learning audio-visual sound localization audio-visual correspondence self-supervised

来源：评论

学校读者我要写书评

暂无评论

learning to See and Hear without Human Supervision

Learning to See and Hear without Human Supervision

引用

作者： Maravilha Morgado, Pedro Miguel University of California San Diego

学位级别：Ph.D., Doctor of Philosophy

Imagine the sound of waves. This sound may evoke the memories of days at the beach. A single sound serves as a bridge to connect multiple instances of a visual scene. It can group scenes that 'go together' and set apart the ones that do not. Co-occurring sensory signals can thus be used as a target to learn powerful representations for visual inputs without relying on costly human annotations. In this thesis, I introduce effective self-supervised learning methods that curb the need for human supervision. I discuss several tasks that benefit from audio-visual learning, including representation learning for action and audio recognition, visually-driven sound source localization, and spatial sound generation. I introduce an effective contrastive learning framework that learns audio-visual models by answering multiple-choice audio-visual association questions. I also discuss critical challenges we face when learning from audio supervision related to noisy audio-visual associations, and the lack of spatial grounding of sound signals in common videos.

关键词： audio-visual learning

来源：评论

学校读者我要写书评

暂无评论

Leveraging the Video-Level Semantic Consistency of Event for audio-visual Event Localization

引用

IEEE TRANSACTIONS ON MULTIMEDIA 2024年 26卷 4617-4627页

作者： Jiang, Yuanyuan Yin, Jianqin Dang, Yonghao Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing 100876 Peoples R China

audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.

关键词： audio-visual learning event localization video understanding weakly-supervised learning

来源：评论

学校读者我要写书评

暂无评论

Semantic and Relation Modulation for audio-visual Event Localization

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023年第6期45卷 7711-7725页

作者： Wang, Hao Zha, Zheng-Jun Li, Liang Chen, Xuejin Luo, Jiebo Univ Sci & Technol China Sch Informat Sci & Technol Hefei 230052 Anhui Peoples R China Chinese Acad Sci Inst Comp Technol Beijing 100045 Peoples R China Univ Rochester Dept Comp Sci Rochester NY 14627 USA

We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel Semantic and Relation Modulation Network (SRMN) to learn the above correlation and leverage it to modulate the related auditory, visual, and fused features. In particular, for semantic modulation, we propose intra-modal normalization and cross-modal normalization. The former modulates features of a single modality with the event-relevant semantic guidance of the same modality. The latter modulates features of two modalities by establishing and exploiting the cross-modal relationship. For relation modulation, we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments, which strengthen correlations between successive segments within one proposal and between all segments. With the features modulated by the correlation information regarding audio-visual events, SRMN performs accurate event localization. Extensive experiments conducted on the public AVE dataset demonstrate that our method outperforms the state-of-the-art methods in both supervised event localization and cross-modality localization tasks.

关键词： visualization Location awareness Correlation Proposals Semantics Task analysis Modulation audio-visual learning event localization normalization

来源：评论

学校读者我要写书评

暂无评论

An audio-visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024年第10期46卷 6637-6651页

作者： Li, Kai Xie, Fenghua Chen, Hang Yuan, Kexin Hu, Xiaolin Tsinghua Univ Inst Artificial Intelligence IDG McGovern Inst Brain Res Tsinghua Lab Brain & Intelligence THBIDept Comp S Beijing 100084 Peoples R China Tsinghua Univ IDG McGovern Inst Brain Res Sch Med Tsinghua Lab Brain & Intelligence THBIDept Biomed Beijing 100084 Peoples R China Chinese Inst Brain Res CIBR Beijing 100010 Peoples R China

audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks.

关键词： audio-visual learning brain-inspired model cortico-thalamo-cortical circuit speech separation audio-visual learning brain-inspired model cortico-thalamo-cortical circuit speech separation

来源：评论

学校读者我要写书评

暂无评论

Contrastive Positive Sample Propagation Along the audio-visual Event Line

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023年第6期45卷 7239-7257页

作者： Zhou, Jinxing Guo, Dan Wang, Meng Hefei Univ Technol HFUT Sch Comp Sci & Informat Engn Sch Artificial Intelligence Key Lab Knowledge Engn Big Data HFUTMinist Educ Hefei 230601 Peoples R China Hefei Univ Technol HFUT Intelligent Interconnected Syst Lab Anhui Prov Hefei 230601 Peoples R China Hefei Comprehens Natl Sci Ctr Inst Artificial Intelligence Hefei 230601 Peoples R China

visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSAS and PSAV). Three new contrastive objectives are proposed (i.e., L-avpsp, L-spsa, and L-vpsa) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

关键词： visualization Task analysis Image segmentation Synchronization Roads Aggregates Representation learning audio-visual event audio-visual learning contrastive learning positive sample propagation

来源：评论

学校读者我要写书评

暂无评论

Advancing Weakly-Supervised audio-visual Video Parsing via Segment-Wise Pseudo Labeling

引用

INTERNATIONAL JOURNAL OF COMPUTER VISION 2024年第11期132卷 5308-5329页

作者： Zhou, Jinxing Guo, Dan Zhong, Yiran Wang, Meng Hefei Univ Technol Hefei Peoples R China Shanghai AI Lab Shanghai Peoples R China Hefei Comprehens Natl Sci Ctr Hefei Peoples R China Anhui Zhonghuitong Technol Co Ltd Hefei Peoples R China

The audio-visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, i.e., audio event, visual event, and audio-visual event. Furthermore, our experiments verify that the high-quality segment-level pseudo labels provided by our method can be flexibly combined with other audio-visual video parsing backbones and consistently improve their performances. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify

关键词： audio-visual video parsing audio-visual event localization Pseudo labeling Label denoising audio-visual learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：