检索结果-内蒙古大学图书馆

audio-visual segmentation with Semantics

INTERNATIONAL JOURNAL OF COMPUTER VISION 2025年第4期133卷 1644-1664页

作者： Zhou, Jinxing Shen, Xuyang Wang, Jianyuan Zhang, Jiayi Sun, Weixuan Zhang, Jing Birchfield, Stan Guo, Dan Kong, Lingpeng Wang, Meng Zhong, Yiran Hefei Univ Technol Hefei Peoples R China Shanghai AI Lab Shanghai Peoples R China Univ Oxford Oxford England Beihang Univ Beijing Peoples R China Australian Natl Univ Canberra Australia Nvidia Santa Clara CA USA Univ Hong Kong Hong Kong Peoples R China

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source;2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://***/OpenNLPLab/AVSBench.

关键词： audio-visual segmentation Multi-modal segmentation audio-visual learning AVSBench Semantic segmentation Video segmentation

来源：评论

学校读者我要写书评

暂无评论

Cross-Modal Cognitive Consensus Guided audio-visual segmentation

引用

IEEE TRANSACTIONS ON MULTIMEDIA 2025年 27卷 209-223页

作者： Shi, Zhaofeng Wu, Qingbo Meng, Fanman Xu, Linfeng Li, Hongliang Univ Elect Sci & Technol China Sch Informat & Commun Engn Chengdu 611731 Peoples R China

audio-visual segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source segmentation (S4) setting and Multiple Sound Source segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.

关键词： visualization Semantics Feature extraction Object segmentation Location awareness Data mining Feeds Attention mechanisms Transformers Synchronization audio-visual segmentation cross-modal cognitive consensus semantic-level consistency

来源：评论

学校读者我要写书评

暂无评论

Transformer-Prompted Network: Efficient audio-visual segmentation via Transformer and Prompt Learning

引用

IEEE SIGNAL PROCESSING LETTERS 2025年 32卷 516-520页

作者： Wang, Yusen Qian, Xiaohong Zhou, Wujie Zhejiang Univ Sci & Technol Sch Informat & Elect Engn Hangzhou 310023 Peoples R China

audio-visual segmentation (AVS) is a challenging task that focuses on segmenting sound-producing objects within video frames by leveraging audio signals. Existing convolutional neural networks (CNNs) and Transformer-based methods extract features separately from modality-specific encoders and then use fusion modules to integrate the visual and auditory features. We propose an effective Transformer-prompted network, TPNet, which utilizes prompt learning with a Transformer to guide the CNN in addressing AVS tasks. Specifically, during feature encoding, we incorporate a frequency-based prompt-supplement module to fine-tune and enhance the encoded features through frequency-domain methods. Furthermore, during audio-visual fusion, we integrate a self-supplementing cross-fusion module that uses self-attention, two-dimensional selective scanning, and cross-attention mechanisms to merge and enhance audio-visual features effectively. The prompt features undergo the same processing in cross-modal fusion, further refining the fused features to achieve more accurate segmentation results. Finally, we apply self-knowledge distillation to the network, further enhancing the model performance. Extensive experiments on the AVSBench dataset validate the effectiveness of TPNet.

关键词： Transformers Feature extraction Frequency-domain analysis Europe visualization Location awareness Convolution Computer vision Pattern recognition Image segmentation audio-visual segmentation transformer CNN prompt learning self-knowledge distillation

来源：评论

学校读者我要写书评

暂无评论

Consistency-Queried Transformer for audio-visual segmentation

引用

IEEE TRANSACTIONS ON IMAGE PROCESSING 2025年 34卷 2616-2627页

作者： Lv, Ying Liu, Zhi Chang, Xiaojun Shanghai Univ Shanghai Inst Adv Commun & Data Sci Sch Commun & Informat Engn Joint Int Res Lab Specialty Fiber Opt & Adv Commun Shanghai 200444 Peoples R China Shanghai Univ Wenzhou Inst Wenzhou 325000 Peoples R China Univ Technol Sydney Australian Artificial Intelligence Inst Fac Engn & Informat Technol Sydney NSW 2007 Australia

audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.

关键词： visualization Transformers Decoding Semantics Semantic segmentation Quadrature amplitude modulation Generators Accuracy Training Correlation audio-visual segmentation multimodal segmentation consistency aligned matching

来源：评论

学校读者我要写书评

暂无评论

Bootstrapping audio-visual Video segmentation by Strengthening audio Cues

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2025年第3期35卷 2398-2409页

作者： Chen, Tianxiang Tan, Zhentao Gong, Tao Chu, Qi Wu, Yue Liu, Bin Yu, Nenghai Lu, Le Ye, Jieping Univ Sci & Technol China Sch Cyber Sci & Technol Hefei 230026 Peoples R China Chinese Acad Sci Key Lab Electromagnet Space Informat Hefei 230022 Peoples R China Alibaba Grp Alibaba Cloud Hangzhou 310024 Peoples R China Alibaba Grp DAMO Acad New York NY 10014 USA

How to effectively interact audio with vision has garnered considerable interest within the multi-modality research field. Recently, a novel audio-visual video segmentation (AVS) task has been proposed, aiming to segment the sounding objects in video frames under the guidance of audio cues. However, most existing AVS methods are hindered by a modality imbalance where the visual features tend to dominate those of the audio modality, due to a unidirectional and insufficient integration of audio cues. This imbalance skews the feature representation towards the visual aspect, impeding the learning of joint audio-visual representations and potentially causing segmentation inaccuracies. To address this issue, we propose AVSAC. Our approach features a Bidirectional audio-visual Decoder (BAVD) with integrated bidirectional bridges, enhancing audio cues and fostering continuous interplay between audio and visual modalities. This bidirectional interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. Additionally, we present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances the share of auditory components in visual features, contributing to a more balanced audio-visual representation learning. Extensive experiments show that our method has state-of-the-art performance on several AVS public benchmarks.

关键词： visualization Decoding Circuits and systems Transformers Feature extraction Bridge circuits Benchmark testing Representation learning Computer vision Attention mechanisms audio-visual segmentation audio-visual modality imbalance audio cue enhancement

来源：评论

学校读者我要写书评

暂无评论

audio-visual segmentation based on robust principal component analysis

引用

EXPERT SYSTEMS WITH APPLICATIONS 2024年 256卷

作者： Fang, Shun Zhu, Qile Wu, Qi Wu, Shiqian Xie, Shoulie Wuhan Univ Sci & Technol Sch Informat Sci & Engn Wuhan Peoples R China Wuhan Univ Sci & Technol Inst Robot & Intelligent Syst Wuhan Peoples R China Jiangxi Univ Finance & Econ Sch Software & Internet things Engn Nanchang Peoples R China Henan Acad Sci Inst Adv Displays & Imaging Zhengzhou Peoples R China RF & Opt Dept Inst Infocomm Res A STAR Signal Proc Singapore Singapore

audio-visual segmentation (AVS) aims to extract the sounding objects from a video. The current learning- based AVS methods are often supervised, which rely on specific task data annotations and expensive model training. Recognizing that the video background captured by a static camera is represented as a low-rank matrix, we introduce the non-convex robust principal component analysis into AVS task in this paper. This approach is unsupervised and only relies on input data patterns. Specifically, the proposed method decomposes each modality into the sum of two parts: the low-rank part that represents the background audio and visual information, and the sparse part that represents the foreground information. Furthermore, CUR decomposition is employed at each iteration to reduce the computational complexity in optimization. The experimental results also show that the proposed AVS outperforms the supervised methods on AVS-Bench Single-Source datasets.

关键词： audio-visual segmentation Robust principal component analysis Unsupervised learning

来源：评论

学校读者我要写书评

暂无评论

audio-visual segmentation by Exploring Cross-Modal Mutual Semantics 23

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Se...

引用

31st ACM International Conference on Multimedia (MM)

作者： Liu, Chen Li, Peike Patrick Qi, Xingqun Zhang, Hu Li, Lincheng Wang, Dadong Yu, Xin Univ Queensland Brisbane Qld Australia Univ Technol Sydney Sydney NSW Australia Matrix Verse Sydney NSW Australia Netease Fuxi AI Lab Hangzhou Peoples R China CSIRO DATA61 Marsfield Australia

ISBN: (纸本)9798400701085

The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects and also achieves state-of-the-art performance in both the single-source and multi-source scenarios.

关键词： audio-visual segmentation sound localization semantic-aware sounding objects localization

来源：评论

学校读者我要写书评

暂无评论

audio-visual segmentation via Unlabeled Frame Exploitation

Audio-Visual Segmentation via Unlabeled Frame Exploitation

引用

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Liu, Jinxiang Liu, Yikun Zhang, Fei Ju, Chen Zhang, Ya Wang, Yanfeng Shanghai Jiao Tong Univ Cooperat Medianet Innovat Ctr Shanghai Peoples R China Shanghai AI Lab Shanghai Peoples R China

ISBN: (纸本)9798350353006

audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.

关键词： audio-visual segmentation audio-visual understanding video understanding

来源：评论

学校读者我要写书评

暂无评论

audio-visual segmentation 17th

Audio-Visual Segmentation

引用

17th European Conference on Computer Vision (ECCV)

作者： Zhou, Jinxing Wang, Jianyuan Zhang, Jiayi Sun, Weixuan Zhang, Jing Birchfield, Stan Guo, Dan Kong, Lingpeng Wang, Meng Zhong, Yiran Hefei Univ Technol Hefei Peoples R China SenseTime Res Hangzhou Peoples R China Australian Natl Univ Canberra ACT Australia Beihang Univ Beijing Peoples R China NVIDIA Santa Clara CA USA Univ Hong Kong Pok Fu Lam Hong Kong Peoples R China Shanghai Artificial Intelligence Lab Shanghai Peoples R China

ISBN: (纸本)9783031198359;9783031198366

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a new method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVS-Bench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics.

关键词： audio-visual segmentation Benchmarking AVSBench

来源：评论

学校读者我要写书评

暂无评论

Enhance audio-visual segmentation with hierarchical encoder and audio guidance

引用

NEUROCOMPUTING 2024年 594卷

作者： Guo, Cunhan Huang, Heyan Zhou, Yanghao Univ Chinese Acad Sci Sch Emergency Management Sci & Engn 1 Yanqihu East Rd Beijing 101400 Peoples R China Beijing Inst Technol Southeast Acad informat Technol 1998 Licheng Middle Ave Putian 351100 Fujian Peoples R China Beijing Inst Technol Sch Comp Sci & Technol 5 Zhongguancun South St Beijing 101400 Peoples R China

As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenarios such as emergency rescue and natural exploration. Nevertheless, the performance of audio-visual segmentation technology encounters limitations stemming from challenges related to the adaptation and fusion of crossmodal information encoding, as well as the decoding and generation of masks. To address these issues, this paper explores the adaptation of multi -modal information based on a shared encoder by employing a neural architecture search method to design a hierarchical encoder cooperation module for enhanced information interaction. An intermediate loss is leveraged to help the encoder to keep spatial knowledge reserved. Furthermore, an audio -guided class -aware decoder is devised to guide the generation of masks. Our approach has yielded competitive experimental results across multiple datasets, thus substantiating its effectiveness.

关键词： audio-visual segmentation Hierarchical encoder Neural architecture search audio guidance

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：