检索结果-内蒙古大学图书馆

Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Huang, Wen Gu, Yanmei Wang, Zhiming Zhu, Huijia Qian, Yanmin Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence AI Institute Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China SJTU Paris Elite lnstitute of Technology China Ant Group Shanghai China

Advances in speech synthesis technologies, like text-to-speech (TTS) and voice conversion (VC), have made detecting deepfake speech increasingly challenging. Spoofing countermeasures often struggle to generalize effectively, particularly when faced with unseen attacks. To address this, we propose a novel strategy that integrates Latent Space Refinement (LSR) and Latent Space Augmentation (LSA) to improve the generalization of deepfake detection systems. LSR introduces multiple learnable prototypes for the spoof class, refining the latent space to better capture the intricate variations within spoofed data. LSA further diversifies spoofed data representations by applying augmentation techniques directly in the latent space, enabling the model to learn a broader range of spoofing patterns. We evaluated our approach on four representative datasets, i.e. ASVspoof 2019 LA, ASVspoof 2021 LA and DF, and In-The-Wild. The results show that LSR and LSA perform well individually, and their integration achieves competitive results, matching or surpassing current state-of-the-art methods. Copyright © 2025, The Authors. All rights reserved.

关键词： Speech recognition

FRONT-END ADAPTER: ADAPTING FRONT-END INPUT OF SPEECH BASED SELF-SUPERVISED LEARNING FOR SPEECH RECOGNITION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Chen, Xie Ma, Ziyang Tang, Changli Wang, Yujin Zheng, Zhisheng MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Department of Electronic Engineering Tsinghua University Beijing China

Recent years have witnessed a boom in self-supervised learning (SSL) in various areas including speech processing. Speech based SSL models present promising performance in a range of speech related tasks. However, the training of SSL models is computationally expensive and a common practice is to fine-tune a released SSL model on the specific task. It is essential to use consistent front-end input during pre-training and fine-tuning. This consistency may introduce potential issues when the optimal front-end is not the same as that used in pre-training. In this paper, we propose a simple but effective front-end adapter to address this front-end discrepancy. By minimizing the distance between the outputs of different front-ends, the filterbank feature (Fbank) can be compatible with SSL models which are pre-trained with waveform. The experiment results demonstrate the effectiveness of our proposed front-end adapter on several popular SSL models for the speech recognition task. Copyright © 2023, The Authors. All rights reserved.

关键词： Supervised learning

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ma, Da Chen, Lu Zhang, Situo Miao, Yuxun Zhu, Su Chen, Zhi Xu, Hongshen Li, Hanqi Fan, Shuai Pan, Lei Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence SJTU AI Institute Shanghai Jiao Tong University Shanghai China AISpeech Co. Ltd. Suzhou China ByteDance China

The increasing context window size in Large Language Models (LLMs), such as the GPT and LLaMA series, has improved their ability to tackle complex, long-text tasks, but at the cost of inference efficiency, particularly regarding memory and computational complexity. Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation. In this paper, we propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them. We address two challenges: 1) investigating the distribution of important tokens in the context, discovering recent tokens are more important than distant tokens in context, and 2) optimizing resources for distant tokens by sharing attention scores across layers. The experiments show that our method saves 35% KV cache without compromising the performance. © 2024, CC BY.

关键词： Cache memory

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation

学校读者我要写书评

暂无评论

Improving Few-Shot Learning for Talking Face System with TTS...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Qi Chen Ziyang Ma Tao Liu Xu Tan Qu Lu Kai Yu Xie Chen Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University China Microsoft Research Asia Shanghai Media Tech

Audio-driven talking face has attracted broad interest from academia and industry recently. However, data acquisition and labeling in audio-driven talking face are labor-intensive and costly. The lack of data resource results in poor synthesis effect. To alleviate this issue, we propose to use TTS (Text-To-Speech) for data augmentation to improve few-shot ability of the talking face system. The misalignment problem brought by the TTS audio is solved with the introduction of soft-DTW, which is first adopted in the talking face task. Moreover, features extracted by HuBERT are explored to utilize underlying information of audio, and found to be superior over other features. The proposed method achieves 17%, 14%, 38% dominance on MSE score, DTW score and user study preference repectively over the baseline model, which shows the effectiveness of improving few-shot learning for talking face system with TTS augmentation.

关键词： Industries Data acquisition Signal processing Feature extraction Acoustics labeling Data mining

DDTSE: DISCRIMINATIVE DIFFUSION MODEL FOR TARGET SPEECH EXTRACTION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Leying Qian, Yao Yu, Linfeng Wang, Heming Yang, Hemin Liu, Shujie Zhou, Long Qian, Yanmin Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence AI Institute Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Microsoft United States

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves higher perceptual quality but also accelerates the inference process by 3 times compared to the conventional diffusion model. © 2023, CC BY-NC-SA.

关键词： Digital elevation model

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Danyang Chen, Lu Zhang, Situo Xu, Hongshen Zhao, Zihan Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence SJTU AI Institute Shanghai Jiao Tong University Shanghai China Suzhou Laboratory Suzhou China

Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) agent framework is proposed as REMEMBERER. By equipping the LLM with a long-term experience memory, REMEMBERER is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce Reinforcement Learning with Experience Memory (RLEM) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of REMEMBERER. © 2023, CC BY.

关键词： Reinforcement learning

TARGET SOUND EXTRACTION WITH VARIABLE CROSS-MODALITY CLUES

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Li, Chenda Qian, Yao Chen, Zhuo Wang, Dongmei Yoshioka, Takuya Liu, Shujie Qian, Yanmin Zeng, Michael MoE Key Lab of Artificial Intelligence AI Institute China X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University China Microsoft RedmondWA United States

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage variable number of clues cross modalities available in the inference phase, including a video, a sound event class, and a text caption, we propose a unified transformer-based TSE model architecture, where a multi-clue attention module integrates all the clues across the modalities. Since there is no off-the-shelf benchmark to evaluate our proposed approach, we build a dataset 1 based on public corpora, Audioset and AudioCaps. Experimental results for seen and unseen target-sound evaluation sets show that our proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues. Copyright © 2023, The Authors. All rights reserved.

关键词： Extraction

Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing

学校读者我要写书评

暂无评论

Exploring Time-Frequency Domain Target Speaker Extraction Fo...

IEEE Workshop on Automatic Speech Recognition and Understanding

作者： Wangyou Zhang Lei Yang Yanmin Qian Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China Samsung Research China – Beijing (SRC-B)

In recent years, target speaker extraction (TSE) has drawn increasing interest as an alternative to speech separation in realistic applications. While time-domain methods have been widely used in recent studies to achieve high performance, the potential of time-frequency (T-F) domain methods have been less explored. In this paper, we try to fill this gap and propose to incorporate the top-performing T-F domain speech separation method into the TSE framework. We first explore different speaker information fusion methods for the proposed model. In addition to the commonly-used concatenation based fusion, we propose a novel speaker token-based fusion method to fuse the target speaker information. Second, we show that the proposed model can be easily extended for causal processing with strong performance. Experiments on the WSJ0-2mix and LibriMix benchmarks show that our proposed model outperforms the widely-used time-domain models in both causal and non-causal settings by a large margin.

关键词：

ONE-SHOT SENSITIVITY-AWARE MIXED SPARSITY PRUNING FOR LARGE LANGUAGE MODELS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Shao, Hang Liu, Bei Xiao, Bo Zeng, Ke Wan, Guanglu Qian, Yanmin Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence AI Institute Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Meituan Beijing China

Various Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code1 © 2023, CC BY.

关键词： Computational linguistics