检索结果-内蒙古大学图书馆

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Zeng, Xiao-Min Song, Yan Zhuo, Zhu Zhou, Yu Li, Yu-Hong Xue, Hui Dai, Li-Rong McLoughlin, Ian Alibaba Group China University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China Singapore Institute of Technology Ict Cluster Singapore

ISBN: (纸本)9781728163277

In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods. © 2023 IEEE.

关键词： anomalous sound detection contrastive learning predictive autoencoder representation learning

来源：评论

学校读者我要写书评

暂无评论

DQ-Data2vec: Decoupling Quantization for Multilingual speech Recognition

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 1337-1348页

作者： Qijie Shao Linhao Dong Kun Wei Sining Sun Lei Xie Audio Speech and Language Processing Group (ASLP) School of Computer Science and Engineering Northwestern Polytechnical University Xi'an China Bytedance Speech Beijing Bytedance Technology Company Ltd. Beijing China

Data2vec is a self-supervised learning (SSL) approach that employs a teacher-student architecture for contextual representation learning via masked prediction, demonstrating remarkable performance in monolingual ASR. Previous studies have revealed that data2vec's shallow layers capture speaker and language information, middle layers encode phoneme and word features, while deep layers are responsible for reconstruction. language and phoneme features are crucial for multilingual ASR. However, data2vec's masked representation generation relies on multi-layer averaging, inevitably coupling these features. To address this limitation, we propose a decoupling quantization based data2vec (DQ-Data2vec) for multilingual ASR, which includes a data2vec backbone and two improved online K-means quantizers. Our core idea is using the K-means quantizer with specified cluster numbers to decouple language and phoneme information for masked prediction. Specifically, in the language quantization, considering that the number of languages is significantly different from other irrelevant features (e.g., speakers), we assign the cluster number to match the number of languages, explicitly decoupling shallow layers' language-related information from irrelevant features. This strategy is also applied to decoupling the middle layers' phoneme and word features. In a self-supervised scenario, experiments on the CommonVoice dataset demonstrate that DQ-Data2vec achieves a relative reduction of ${9.51\%}$ in phoneme error rate (PER) and ${11.58\%}$ in word error rate (WER) compared to data2vec and UniData2vec. Moreover, in a weakly-supervised scenario incorporating language labels and high-resource language text labels, the relative reduction is ${18.09\%}$ and ${1.55\%}$ , respectively.

关键词： Multilingual Quantization (signal) Training Feature extraction Representation learning Data mining Error analysis Vectors Transformers Sun

来源：评论

学校读者我要写书评

暂无评论

Vec-Tok speech: speech Vectorization and Tokenization for Neural speech Generation

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 1243-1254页

作者： Xinfa Zhu Yuanjun Lv Yi Lei Tao Li Wendi He Hongbin Zhou Heng Lu Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi'an China Ximalaya Inc. Shanghai China

language models (LMs) have recently flourished in natural language processing and computer vision, generating high-quality texts and images in various tasks. While current speech LMs have made significant progress, there are still challenges to overcome in terms of achieving optimal speech quality and broad task generalization. This paper presents Vec-Tok speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context, improving the performance of LMs. Vec-Tok speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok speech, built on 50,000 hours of speech, performs better than other SOTA models.

关键词： speech coding Codecs Vectors Semantics Linguistics speech processing Bit rate speech Decoding Acoustics

来源：评论

学校读者我要写书评

暂无评论

RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

arXiv

引用

arXiv 2025年

作者： Yan, Shi-Qi Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization. Copyright © 2025, The Authors. All rights reserved.

关键词： Content based retrieval

来源：评论

学校读者我要写书评

暂无评论

Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

arXiv

引用

arXiv 2023年

作者： Shen, Lingfeng Tan, Weiting Zheng, Boyuan Khashabi, Daniel Center for Language and Speech Processing and Computer Science Department Johns Hopkins University BaltimoreMD United States

With the growing capabilities of large language models, prompting them has become the dominant way to access them. This has motivated the development of strategies for automatically selecting effective language prompts. In this paper, we introduce PFLAT (prompt flatness), a new metric to quantify the expected utility of a language prompt. This metric is inspired by flatness regularization in statistical learning that quantifies the robustness of the model towards its parameter perturbations. We provide theoretical foundations for this metric and its relationship with other prompt selection metrics, providing a comprehensive understanding of existing methods. Empirically, we show that combining PFLAT with existing metrics improves both performance and sample efficiency. Our metric outperforms the previous prompt selection metrics with an average increase of 10% in Pearson correlation across 6 classification benchmarks, and the prompt selected by our metric gains 5% higher accuracy than previous metrics across the benchmarks. © 2023, CC BY.

关键词： Efficiency

来源：评论

学校读者我要写书评

暂无评论

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

arXiv

引用

arXiv 2024年

作者： Zheng, Rui-Chen Du, Hui-Peng Jiang, Xiao-Hang Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China

Current neural audio codecs typically use residual vector quantization (RVQ) to discretize speech signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal performance. To address this problem, we introduce ERVQ, Enhanced Residual Vector Quantization, a novel enhancement strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec performance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online clustering strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimization improves the diversity of quantized features by minimizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models, sampling rates, and bitrates, achieving superior quality and generalization capabilities. It also achieves 100% codebook utilization on one of the most advanced neural audio codecs. Further experiments indicate that audio codecs improved by the ERVQ strategy can improve unified speech-and-text large language models (LLMs). Specifically, there is a notable improvement in the naturalness of generated speech in downstream zero-shot text-to-speech tasks. Audio samples are available here.1 Copyright © 2024, The Authors. All rights reserved.

关键词： Vector quantization

来源：评论

学校读者我要写书评

暂无评论

SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features

SAMOS: A Neural MOS Prediction Model Leveraging Semantic Rep...

引用

International Symposium on Chinese Spoken language processing

作者： Yu-Fei Shi Yang Ai Ye-Xin Lu Hui-Peng Du Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the auto-matic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multitask heads and an aggregation layer, to obtain the final MOS score. Ex-perimental results demonstrate that the proposed SAMOS outperforms current state-of-the-art MOS prediction models on the BVCC dataset and performs comparable performance on the BC2019 dataset, according to the results of system-level evaluation metrics.

关键词： Measurement Training Accuracy Semantics Predictive models Feature extraction Acoustics speech synthesis

来源：评论

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Com...

引用

International Symposium on Chinese Spoken language processing

作者： Hui-Peng Du Yang Ai Rui-Chen Zheng Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

This paper proposes a novel neural audio codec, named AP-Codec+, which is an improved version of APCodec. The AP-Codec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm.

关键词： Training Codecs Quantization (signal) Bit rate Training data Encoding Decoding

来源：评论

学校读者我要写书评

暂无评论

DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification

DiffAttack: Diffusion-based Timbre-reserved Adversarial Atta...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qing Wang Jixun Yao Zhaokai Sun Pengcheng Guo Lei Xie John H.L. Hansen Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China Center for Robust Speech Systems (CRSS) The University of Texas Dallas USA

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach, that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the diffusion-based voice conversion model’s generative process, we aim to craft fake samples that effectively mislead target models while preserving the speaker-wised characteristics. Specifically, inspired by the utilization of randomly sampled Gaussian noise in conventional adversarial attack and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. As a result, these adversarial constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that our proposed DiffAttack significantly improves the attack success rate compared to vanilla DiffVC or other methods. Furthermore, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model.

关键词： Gaussian noise Diffusion processes Signal processing Biometric identification Robustness Acoustics Timbre Security speech processing

来源：评论

学校读者我要写书评

暂无评论

CSDNet: cross-sketch with dual gated attention for fine-grained image captioning network

引用

Multimedia Tools and Applications 2024年 1-28页

作者： Hossain, Md. Shamim Aktar, Shamima Hossen, Md. Bipul Hossain, Mohammad Alamgir Gu, Naijie Huang, Zhangjin School of Computer Science and Technology University of Science and Technology of China Anhui Hefei230027 China Deqing Alpha Innovation Institute Huzhou313299 China Department of Mathematics Jashore University of Science and Technology Jashore7408 Bangladesh Department of Statistics Begum Rokeya University Rangpur5404 Bangladesh National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China Anhui Hefei230027 China

In the realm of extracting inter and intra-modal interactions, contemporary models often face challenges such as reduced computational efficiency, particularly when dealing with lengthy visual sequences. To address these issues, this study introduces an innovative model, the Cross-Sketch with Dual Gated Attention Network (CSDNet), designed to handle second-order intra- and inter-modal interactions by integrating a couple of attention modules. Leveraging bilinear pooling to effectively capture these second-order interactions typically requires substantial computational resources due to the processing of large-dimensional tensors. Due to these resource demands, the first module Cross-Sketch Attention (CSA) is proposed, which employs Cross-Tensor Sketch Pooling on attention features to reduce dimensionality while preserving crucial information without sacrificing caption quality. Furthermore, to enhance caption by integrating another novel attention module, Dual Gated Attention (DGA), which contributes additional spatial and channel-wise attention distributions to improve caption generation performance. Our method demonstrates significant computational efficiency improvements, reducing computation time per epoch by an average of 13.54% compared to the base model, which leads to expedited convergence and improved performance metrics. Additionally, we observe a 0.07% enhancement in the METEOR score compared to the base model. Through the application of reinforcement learning optimization, our model achieves a remarkable CIDEr-D score of 132.2% on the MS-COCO dataset. This consistently outperforms baseline performance across a comprehensive range of evaluation metrics. © The Author(s), under exclusive licence to Springer science+Business Media, LLC, part of Springer Nature 2024.

关键词： Tensors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：