检索结果-内蒙古大学图书馆

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Han-Jie Guo Hui-Peng Du Zheng-Yan Sheng Li-Ping Chen Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P.R.China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Cross-lingual voice conversion (XVC) is a technology that modifies speaker identity while preserving linguistic content in scenarios where the source and target speakers use different languages. Previous non-parallel disentanglement-based methods face severe training-testing inconsistency issues in XVC tasks due to language mismatch and the lack of multilingual parallel data, which inevitably compromise the quality of the synthesized speech. In this paper, we propose CASC-XVC, a zero-shot XVC method incorporating with content accordant (CA) and speaker contrastive (SC) losses. Specifically, this method adopts the framework of FreeVC-s as the backbone. We design a cross-lingual fine-tuning process employing pairs of utterances from speakers in different languages to update the modules used in the inference stage. A CA loss and an SC loss are introduced to deal with the lack of true parallel targets in the fine-tuning process. Moreover, we use shared self-supervised learning (SSL) representations across different languages along with information perturbation for content disentanglement. Both subjective and objective results on a bilingual (English and Chinese) dataset demonstrate that our approach achieves significant improvements in XVC tasks.

关键词： Perturbation methods Self-supervised learning Signal processing Acoustics Multilingual speech processing Faces

来源：评论

学校读者我要写书评

暂无评论

Recursive Feature Learning from Pre-Trained Models for Spoofing speech Detection

Recursive Feature Learning from Pre-Trained Models for Spoof...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Yu Guan Yang Ai Zuoliang Li Shengyu Peng Wu Guo National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

It was recently revealed that using features extracted from pre-trained models can achieve much better performance than using conventional hand-crafted acoustic features for spoofing speech detection. In this paper, we therefore enhance the features from pre-trained model based on recursive learning. Specifically, we modify the pre-trained model by feeding the features from the topmost transformer layer to bottom layers recursively, and the obtained recursive features from the bottom layers are fused with that from topmost layer. The fused features are then fed into the backend classifiers. Experiments are carried out on two benchmark datasets (i.e., ASVspoof 2019 LA and ASVspoof 2021 LA), which show the superiority of the proposed method over state-of-the-art systems.

关键词： Voice activity detection Representation learning Linear regression Signal processing Benchmark testing Feature extraction Transformers Excavation Acoustics

来源：评论

学校读者我要写书评

暂无评论

A Machine Learning Approach for MIDI to Guitar Tablature Conversion 19

A Machine Learning Approach for MIDI to Guitar Tablature Con...

引用

19th Sound and Music Computing Conference, SMC 2022

作者： Kaliakatsos-Papakostas, Maximos Bastas, Grigoris Makris, Dimos Herremans, Dorrien Katsouros, Vassilis Maragos, Petros Institute for Language and Speech Processing Athena R.C. Athens Greece School of Electrical and Computer Engineering NTUA Athens Greece Department of Computer Science and Design Pillar SUTD Singapore

ISBN: (纸本)9782958412609

Guitar tablature transcription consists in deducing the string and the fret number on which each note should be played to reproduce the actual musical part. This assignment should lead to playable string-fret combinations throughout the entire track and, in general, preserve parsimonious motion between successive combinations. Throughout the history of guitar playing, specific chord fingerings have been developed across different musical styles that facilitate common idiomatic voicing combinations and motion between them. This paper presents a method for assigning guitar tablature notation to a given MIDI-based musical part (possibly consisting of multiple polyphonic tracks), i.e. no information about guitar-idiomatic expressional characteristics is involved (e.g. bending etc.) The current strategy is based on machine learning and requires a basic assumption about how much fingers can stretch on a fretboard;only standard 6-string guitar tuning is examined. The proposed method also examines the transcription of music pieces that was not meant to be played or could not possibly be played by a guitar (e.g. potentially a symphonic orchestra part), employing a rudimentary method for augmenting musical information and training/testing the system with artificial data. The results present interesting aspects about what the system can achieve when trained on the initial and augmented dataset, showing that the training with augmented data improves the performance even in simple, e.g. monophonic, cases. Results also indicate weaknesses and lead to useful conclusions about possible improvements. Copyright: © 2022 First author et al.

关键词： Machine learning

来源：评论

学校读者我要写书评

暂无评论

A Fresh Review on Chinese Pronunciation Acquisition: Insights and Recommendations for L2 Foreign Children

A Fresh Review on Chinese Pronunciation Acquisition: Insight...

引用

International Symposium on Chinese Spoken language processing

作者： Mewlude Nijat Dong Wang Askar Hamdulla School of Computer Science and Technology Xinjiang University Center for Speech and Language Technologies BNRist Tsinghua University

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

This review paper offers a brief summary of recent research on Chinese pronunciation acquisition, with a particular focus on children learning Chinese as a second language (L2). Af-ter a concise introduction to the Chinese pronunciation system, the paper reviews studies on native children's pronunciation de-velopment. The primary emphasis is on the unique challenges encountered by L2 learners, particularly at the initial stages of language acquisition. Drawing from these findings, the paper presents targeted recommendations designed to enhance effective pronunciation learning for young L2 learners of Chinese.

关键词： Reviews Generative AI

来源：评论

学校读者我要写书评

暂无评论

Bs-Plcnet: Band-Split Packet Loss Concealment Network with Multi-Task Learning Framework and Multi-Discriminators

Bs-Plcnet: Band-Split Packet Loss Concealment Network with M...

引用

Acoustics, speech, and Signal processing Workshops (ICASSPW), IEEE International Conference on

作者： Zihan Zhang Jiayao Sun Xianjun Xia Chuanzeng Huang Yijian Xiao Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China ByteDance China

ISBN: (数字)9798350374513

ISBN: (纸本)9798350374520

Packet loss is a common and unavoidable problem in voice over internet phone (VoIP) systems. To deal with the problem, we propose a band-split packet loss concealment network (BS-PLCNet). Specifically, we split the full-band signal into wide-band (0-8kHz) and high-band (8-24kHz). The wide-band signals are processed by a gated convolutional recurrent network (GCRN), while the high-band counterpart is processed by a simple GRU network. To ensure high speech quality and automatic speech recognition (ASR) compatibility, multi-task learning (MTL) framework including fundamental frequency (f0) prediction, linguistic awareness, and multi-discriminators are used. The proposed approach tied for 1 st place in the ICASSP 2024 PLC Challenge.

关键词： Convolution Conferences Packet loss Logic gates Linguistics Multitasking Acoustics

来源：评论

学校读者我要写书评

暂无评论

A Study of Multi-Scale Feature Learning From Pre-Trained Models on Speaker Verification

A Study of Multi-Scale Feature Learning From Pre-Trained Mod...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Shengyu Peng Wu Guo Jie Zhang Zuoliang Li Yu Guan Bin Gu Yang Ai National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

In this paper, a multi-scale feature fusion paradigm is proposed to fully exploit the power of the pre-trained models for text-independent speaker verification. It contains a front-end feature extractor and an enhanced ECAPA-TDNN backend in a cascade manner. The feature extractor incorporates local representations of the CNN layers as well as the global clues of the Transformer layers of the pre-trained models, which are combined to construct the multi-scale discriminative features. The outputs of the feature extractor are then fed into the back-end model (tailored from ECAPA-TDNN) to obtain the final speaker embedding. Results on VoxCeleb datasets validate the superiority of the proposed method with equal error rates of 0.633% and 0.457% on the official trials of Vox1-O using the base and large pre-trained models, respectively.

关键词： Representation learning Error analysis Signal processing Feature extraction Transformers Acoustics speech processing

来源：评论

学校读者我要写书评

暂无评论

DUALSEP: A LIGHT-WEIGHT DUAL-ENCODER CONVOLUTIONAL RECURRENT NETWORK FOR REAL-TIME IN-CAR speech SEPARATION

arXiv

引用

arXiv 2024年

作者： Wang, Ziqian Sun, Jiayao Zhang, Zihan Li, Xingchen Liu, Jie Xie, Lei Audio Speech and Language Processing Group [ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China Huawei Cloud

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in in-car scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6GHz) CPU, it effectively separates speech into distinct speech zones. Our demos are available at https://***/DualSep/. © 2024, CC BY-NC-ND.

关键词： Microphone array

来源：评论

学校读者我要写书评

暂无评论

KhmerFormer: Multi-Scale CNNs-Transformer with External Attention for Ancient Khmer Palm Leaf Isolated Glyph Classification

KhmerFormer: Multi-Scale CNNs-Transformer with External Atte...

引用

Asia-Pacific Signal and Information processing Association Annual Summit and Conference (APSIPA)

作者： Nimol Thuon Jun Du National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China Hefei China

ISBN: (数字)9798350367331

ISBN: (纸本)9798350367348

Ancient Khmer palm leaf manuscripts are invaluable cultural artifacts in Southeast Asia, especially in Cambodia. The preservation and study of these manuscripts are hindered by their complex glyph structures and the scarcity of resources for ancient languages. This paper introduces KhmerFormer, a hybrid model that combines Multi-Scale Convolutional Neural Networks (MS-CNNs) and Vision Transformers (ViTs) with External Attention (EA) mechanisms to enhance glyph classification tasks. Our approach includes a preprocessing algorithm for Khmer glyph enhancements (IEPalmV2) and leverages multi-scale feature extraction with EfficientNet, followed by integration into ViTs where traditional self-attention is replaced by external attention. This model is tailored to address the unique and intricate challenges presented by ancient Khmer manuscripts. Our study is evaluated in ICFHR 2018 and newly extracted datasets show significant performance enhancements, particularly in handling low-resource and imbalanced datasets. Our results highlight the potential of hybrid architectures in advancing the analysis of historical documents and providing robust solutions for their preservation.

关键词： Training computer vision Text analysis Computational modeling Asia computer architecture Linguistics Transformers Feature extraction Robustness

来源：评论

学校读者我要写书评

暂无评论

A Composite Predictive-Generative Approach to Monaural Universal speech Enhancement

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 2312-2325页

作者： Jie Zhang Haoyin Yan Xiaofei Li National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China School of Engineering Westlake University Hangzhou China

It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling.

关键词： Predictive models Diffusion models Computational modeling speech enhancement Training Diffusion processes Stochastic processes Standards Noise reduction Image reconstruction

来源：评论

学校读者我要写书评

暂无评论

MUSA: Multi-Lingual Speaker Anonymization via Serial Disentanglement

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 1664-1674页

作者： Jixun Yao Qing Wang Pengcheng Guo Ziqian Ning Yuguang Yang Yu Pan Lei Xie Audio Speech and Language Processing Group School of Computer Science Northwestern Polytechnical University Xi'an Shaanxi China Department of Electronic & Computer Engineering Hong Kong University of Science and Technology Hong Kong SAR China Kyushu University Fukuoka Japan

Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a MUlti-lingual Speaker Anonymization approach that employs a serial disentanglement strategy to perform a step-by-step disentanglement from a global time-invariant representation to a temporal time-variant representation. By utilizing semantic distillation and self-supervised speaker distillation, the serial disentanglement strategy can avoid strong inductive biases and exhibit superior generalization performance across different languages. Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process. Experimental results on VoicePrivacy official datasets and multi-lingual datasets demonstrate that MUSA can effectively protect speaker privacy while preserving linguistic content and para-linguistic information.

关键词： Data privacy Information integrity Information filtering speech processing Training Semantics Multilingual Vectors Protection Privacy

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：