版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Center for Art & Science and Presentation & Communication Frontier Institute of Science and Technology Xi'an Jiaotong University Xi'an China MOE KLINNS Laboratory Faculty of Electronics and Information Engineering Xi'an Jiaotong University Xi'an China Department of Automation Center for Intelligent and Networked Systems Tsinghua University Beijing China Department of AI Music and Music Information Technology Central Conservatory of Music Beijing China
出 版 物:《IEEE Transactions on Audio, Speech and Language Processing》
年 卷 期:2025年第33卷
页 面:613-626页
基 金:National Natural Science Foundation of China Fundamental Research Funds for the Central Universities
主 题:Instruments Music Feature extraction Accuracy Speech processing Timbre Harmonic analysis Deep learning Training Power harmonic filters
摘 要:Tone quality is of pivotal importance in the auditory perception of musical performance. Depending on the performer and the instrument, tone quality evaluation is subjective and time-consuming, with inherent difficulties stemming from the absence of precise measurement methods. In this study, we develop a novel method for tone quality evaluation utilizing an adversarial domain-invariant learning strategy to construct a representation invariant to changes in pitch, volume, and duration. The wide-band Mel frequency cepstral coefficients are employed for pitch-invariant feature extraction and instance normalization for volume invariance. An adversarial-trained time-delay neural network encoder is developed for enhancing pitch and duration invariance via random pitch shift and temporal segmentation. Experiments conducted on our curated dataset and the Good-sound dataset show that significant improvements from the new method are achieved in evaluating tone quality ascribed to performers and instruments, yielding a 15.3% and 9.5% increase in classification accuracy, respectively, compared to classical feature-based techniques. Remarkably, the class-wise outcomes exhibit enhancements in F-scores of 33.6% and 9.8% for each respective dataset. Ablation studies on pitch, volume, and duration invariance further underscore the efficacy of our approach. This substantial enhancement over existing methods presents a novel perspective on tone quality representation and offers a practical resource for music performance analysis.