检索结果-内蒙古大学图书馆

An IoT-enhanced automatic music composition system integrating audio-visual learning with transformer and SketchVAE

ALEXANDRIA ENGINEERING JOURNAL 2025年 113卷 378-390页

作者： Zhang, Yifei Shanghai Conservatory Mus Dept Composit & Conducting Shanghai 200031 Peoples R China

With the rapid development of artificial intelligence and the Internet of Things technology, the automatic music composition system has become a hot topic of research. This paper presents the TransVAE-Music composition system to achieve efficient multimodal data perception and fusion. Through the introduction of the Internet of Things technology, the system can collect and process audio, video and other data in real time, and improve the diversity and artistry of music generation. At the same time, the Bayesian optimization mechanism is used to finely adjust the hyperparameters in the system to further improve the model performance. Experimental results show that TransVAE-Music has 1.10 and 1.12 reconstruction errors on the POP909 and FMA datasets, respectively, which significantly outperforms other mainstream automatic music generation models. In addition, the model reached 4.8 and 4.9 in perceived quality score (PQS), and 4.4 and 4.5 in user satisfaction score (USS), respectively. These results demonstrate that the proposed system has significant advantages in terms of the accuracy of music generation and the user experience. This study not only provides an effective method for automatic music generation, but also provides important references for future studies on multimodal data fusion and high-quality music generation.

关键词： Automatic music composition Music generation Deep learning audio-visual learning Internet of things (IoT) Multimodal perception

来源：评论

学校读者我要写书评

暂无评论

Metric learning with Progressive Self-Distillation for audio-visual Embedding learning

Metric Learning with Progressive Self-Distillation for Audio...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Zeng, Donghuo Ikeda, Kazushi KDDI Research Inc. Saitama Japan

ISBN: (纸本)9798350368741

Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments-probabilistic aligns between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used to automatically generate soft-alignment labels for the remaining audio-visual samples. These soft-alignment labels are used to construct soft cross-modal triplets, which in turn are employed to fine-tune the model's parameters. Experimental results on two audio-visual benchmark datasets demonstrate the effectiveness of our proposed method in the cross-modal retrieval task, achieving state-of-the-art performance with improvements of 2.13% and 1.82% on the AVE and VEGAS datasets, respectively, in terms of average MAP metrics. © 2025 IEEE.

关键词： audio-visual learning cross-modal retrieval self-distillation Triplet loss

来源：评论

学校读者我要写书评

暂无评论

A novel task and methods to evaluate inter-individual variation in audio-visual associative learning

引用

COGNITION 2024年 242卷 105658页

作者： Pasqualotto, Angela Cochrane, Aaron Bavelier, Daphne Altarelli, Irene Univ Geneva Fac Psychol & Educ Sci FPSE Geneva Switzerland Campus Biotech Geneva Switzerland Univ Paris Cite LaPsyDE CNRS Paris France Univ Geneva Fac Psychol & Educ Sci FPSE Campus Biotech Geneva Switzerland

learning audio-visual associations is foundational to a number of real-world skills, such as reading acquisition or social communication. Characterizing individual differences in such learning has therefore been of interest to researchers in the field. Here, we present a novel audio-visual associative learning task designed to efficiently capture inter-individual differences in learning, with the added feature of using non-linguistic stimuli, so as to unconfound language and reading proficiency of the learner from their more domain-general learning capability. By fitting trial-by-trial performance in our novel learning task using simple-to-use statistical tools, we demonstrate the expected inter-individual variability in learning rate as well as high precision in its estimation. We further demonstrate that such measured learning rate is linked to working memory performance in Italianspeaking (N = 58) and French-speaking (N = 51) adults. Finally, we investigate the extent to which learning rate in our task, which measures cross-modal audio-visual associations while mitigating familiarity confounds, predicts reading ability across participants with different linguistic backgrounds. The present work thus introduces a novel non-linguistic audio-visual associative learning task that can be used across languages. In doing so, it brings a new tool to researchers in the various domains that rely on multisensory integration from reading to social cognition or socio-emotional learning.

关键词： audio-visual learning Associative learning learning rate Cognitive correlates Working memory Individual differences

来源：评论

学校读者我要写书评

暂无评论

Multi-modal spiking tensor regression network for audio-visual zero-shot learning

引用

NEUROCOMPUTING 2025年 629卷

作者： Yang, Zhe Li, Wenrui Hou, Jinxiu Cheng, Guanghui Univ Elect Sci & Technol China Sch Math Sci Chengdu 611731 Sichuan Peoples R China Harbin Inst Technol Dept Comp Sci & Technol Harbin 150001 Peoples R China

Recently, convolutional neural networks have got significant attention, particularly in the field of audio-visual zero-shot learning. It can accurately perceive and capture local features, which allows the model to effectively obtain the corresponding attributes. The original multilinear structure is disrupted when the tensor is flattened as it passes through the fully connected layers. Inspired by this, we introduce a multi-modal spiking tensor regression network (MSTR). MSTR incorporates tensor regression networks with tensor contractions and spiking neural networks featuring threshold adjustments, thus effectively handling temporal and spatial information. It facilitates fine-grained feature extraction while retaining high-dimensional spatial information. Specifically, we use Spiking Neural Networks (SNN) to encode temporal features, and Tensor Regression Networks (TRN) to encode spatial features. Our proposed Temporal-Spatial-Semantic Fusion block combines temporal, spatial, and semantic features for each modality. Finally, the fused audio and visual features pass through a series of cross-modal transformers, further exploring the inner relationship between each modalities. Experimental results on three benchmark datasets, ActivityNet, VGGSound, and UCF, demonstrate that MSTR demonstrates superiorities compared with state-of-the-art models, with significant improvements in harmonic mean (HM) scores on three datasets of 6.0%, 6.8%, and 2.2%, respectively. The code and pre-trained models are available at https://***/xia-zhe/MSTR.

关键词： audio-visual learning Spiking neural network Tensor regression network

来源：评论

学校读者我要写书评

暂无评论

audio-visual self-supervised representation learning: A survey

引用

NEUROCOMPUTING 2025年 634卷

作者： Alsuwat, Manal Al-Shareef, Sarah Alghamdi, Manal Umm Al Qura Univ Dept Comp Sci & Artificial Intelligence Mecca Saudi Arabia

Artificial intelligence developers leverage the inherent relationships among video, text, and audio to create enhanced representations of the world, mirroring the way humans use multiple senses to understand their environment. As such, multimodal learning, which integrates various data input modalities to augment the learning of intrinsic features, has been gaining traction. While applications in multimodal understanding have made strides with deep learning, they often rely heavily on supervised learning and extensive human annotation. This paper provides a comprehensive review of audio-visual self-supervised learning, a promising alternative that uses vast amounts of unlabeled data. It holds the potential to reshape areas like computer vision, and speech recognition. We begin by explaining the concept of audio-visual modalities in machine learning and then move into their role within self-supervised learning by discussing terminology, general pipelines, and underlying motivations. This is followed by an exploration of common pretext tasks in audio- visual self-supervised learning, along with the evaluation methods, datasets, and downstream tasks associated with it. We then highlight prevailing challenges in both audio-visual and self-supervised learning realms. The paper concludes by presenting open challenges, suggesting avenues for future research in this dynamic domain.

关键词： Multimodal Self-supervised learning Deep learning Pretext tasks Data representation audio-visual learning

来源：评论

学校读者我要写书评

暂无评论

A Survey of Multimodal learning: Methods, Applications, and Future

引用

ACM COMPUTING SURVEYS 2025年第7期57卷 1-34页

作者： Yuan, Yuan Li, Zhaojian Zhao, Bin Northwestern Polytech Univ Sch Artificial Intelligence Opt & Elect iOPEN Xian Peoples R China

The multimodal interplay of the five fundamental senses-Sight, Hearing, Smell, Taste, and Touch-provides humans with superior environmental perception and learning skills. Adapted from the human perceptual system, multimodal machine learning tries to incorporate different forms of input, such as image, audio, and text, and determine their fundamental connections through joint modeling. As one of the future development forms of artificial intelligence, it is necessary to summarize the progress of multimodal machine learning. In this article, we start with the form of a multimodal combination and provide a comprehensive survey of the emerging subject of multimodal machine learning, covering representative research approaches, the most recent advancements, and their applications. Specifically, this article analyzes the relationship between different modalities in detail and sorts out the key issues in multimodal research from the application scenarios. Besides, we thoroughly reviewed state-of-the-art methods and datasets covered in multimodal learning research. We then identify the substantial challenges and potential developing directions in this field. Finally, given the comprehensive nature of this survey, both modality-specific and task-specific researchers can benefit from this survey and advance the field.

关键词： Multimodal cross-modal audio-visual learning text-visual touch-visual

来源：评论

学校读者我要写书评

暂无评论

audio-visual Segmentation with Semantics

引用

INTERNATIONAL JOURNAL OF COMPUTER VISION 2025年第4期133卷 1644-1664页

作者： Zhou, Jinxing Shen, Xuyang Wang, Jianyuan Zhang, Jiayi Sun, Weixuan Zhang, Jing Birchfield, Stan Guo, Dan Kong, Lingpeng Wang, Meng Zhong, Yiran Hefei Univ Technol Hefei Peoples R China Shanghai AI Lab Shanghai Peoples R China Univ Oxford Oxford England Beihang Univ Beijing Peoples R China Australian Natl Univ Canberra Australia Nvidia Santa Clara CA USA Univ Hong Kong Hong Kong Peoples R China

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source;2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://***/OpenNLPLab/AVSBench.

关键词： audio-visual segmentation Multi-modal segmentation audio-visual learning AVSBench Semantic segmentation Video segmentation

来源：评论

学校读者我要写书评

暂无评论

STNet: Deep audio-visual Fusion Network for Robust Speaker Tracking

引用

IEEE TRANSACTIONS ON MULTIMEDIA 2025年 27卷 1835-1847页

作者： Li, Yidi Liu, Hong Yang, Bing Taiyuan Univ Technol Coll Comp Sci & Technol Taiyuan 030024 Peoples R China Peking Univ Shenzhen Grad Sch Key Lab Machine Percept Beijing 100871 Peoples R China Westlake Univ Westlake Inst Adv Study Hangzhou 310024 Peoples R China

audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.

关键词： visualization Feature extraction Acoustics Acoustic measurements Target tracking Location awareness Direction-of-arrival estimation Cameras Robustness Fuses audio-visual fusion speaker tracking audio-visual learning cross-modal attention

来源：评论

学校读者我要写书评

暂无评论

Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight

引用

INTERNATIONAL JOURNAL OF COMPUTER VISION 2025年第4期133卷 2136-2157页

作者： Zhang, Yunhua Doughty, Hazel Snoek, Cees G. M. Univ Amsterdam Amsterdam Netherlands Leiden Univ Leiden Netherlands

This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our 'darkness-adaptive' audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. Project page: https://***/Day2Dark/.

关键词： Video analysis audio-visual learning Activity recognition

来源：评论

学校读者我要写书评

暂无评论

CLIP-Powered TASS: Target-Aware Single-Stream Network for audio-visual Question Answering

引用

INTERNATIONAL JOURNAL OF COMPUTER VISION 2025年第5期133卷 2581-2598页

作者： Jiang, Yuanyuan Yin, Jianqin Beijing Univ Posts & Telecommun Sch Artificial Intelligence Xitucheng Rd 10 Beijing 100876 Peoples R China

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the pretrained knowledge of the CLIP model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, TSG+ module transfers the image-text matching knowledge from CLIP models to the required region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the image-text matching knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Besides, we propose a simple yet effective preprocessing strategy to optimize accuracy-efficiency trade-offs. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods. The code is available at https://***/Bravo5542/CLIP-TASS.

关键词： audio-visual question answering audio-visual learning Scene understanding Spatial-temporal reasoning Pretrained knowledge

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：