检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Wang, Ziqian Sun, Jiayao Zhang, Zihan Li, Xingchen Liu, Jie Xie, Lei Audio Speech and Language Processing Group [ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China Huawei Cloud

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in in-car scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6GHz) CPU, it effectively separates speech into distinct speech zones. Our demos are available at https://***/DualSep/. © 2024, CC BY-NC-ND.

关键词： Microphone array

来源：评论

学校读者我要写书评

暂无评论

A Composite Predictive-Generative Approach to Monaural Universal speech Enhancement

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 2312-2325页

作者： Jie Zhang Haoyin Yan Xiaofei Li National Engineering Research Center for Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China (USTC) Hefei China School of Engineering Westlake University Hangzhou China

It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling.

关键词： Predictive models Diffusion models Computational modeling speech enhancement Training Diffusion processes Stochastic processes Standards Noise reduction Image reconstruction

来源：评论

学校读者我要写书评

暂无评论

Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car speech Separation

Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent...

引用

IEEE Spoken language Technology Workshop

作者： Ziqian Wang Jiayao Sun Zihan Zhang Xingchen Li Jie Liu Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Huawei Cloud

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in incar scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83 M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 $(2.6 \mathrm{GHz}) \mathrm{CPU}$, it effectively separates speech into distinct speech zones. Our demos are available at https://***/DualSep/.

关键词： Measurement Deep learning Digital signal processing Real-time systems Vectors Microphone arrays Human vehicle systems Systems support speech processing Low latency communication

来源：评论

学校读者我要写书评

暂无评论

KhmerFormer: Multi-Scale CNNs-Transformer with External Attention for Ancient Khmer Palm Leaf Isolated Glyph Classification

KhmerFormer: Multi-Scale CNNs-Transformer with External Atte...

引用

Asia-Pacific Signal and Information processing Association Annual Summit and Conference (APSIPA)

作者： Nimol Thuon Jun Du National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China Hefei China

ISBN: (数字)9798350367331

ISBN: (纸本)9798350367348

Ancient Khmer palm leaf manuscripts are invaluable cultural artifacts in Southeast Asia, especially in Cambodia. The preservation and study of these manuscripts are hindered by their complex glyph structures and the scarcity of resources for ancient languages. This paper introduces KhmerFormer, a hybrid model that combines Multi-Scale Convolutional Neural Networks (MS-CNNs) and Vision Transformers (ViTs) with External Attention (EA) mechanisms to enhance glyph classification tasks. Our approach includes a preprocessing algorithm for Khmer glyph enhancements (IEPalmV2) and leverages multi-scale feature extraction with EfficientNet, followed by integration into ViTs where traditional self-attention is replaced by external attention. This model is tailored to address the unique and intricate challenges presented by ancient Khmer manuscripts. Our study is evaluated in ICFHR 2018 and newly extracted datasets show significant performance enhancements, particularly in handling low-resource and imbalanced datasets. Our results highlight the potential of hybrid architectures in advancing the analysis of historical documents and providing robust solutions for their preservation.

关键词： Training computer vision Text analysis Computational modeling Asia computer architecture Linguistics Transformers Feature extraction Robustness

来源：评论

学校读者我要写书评

暂无评论

MUSA: Multi-Lingual Speaker Anonymization via Serial Disentanglement

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and language processing 2025年 33卷 1664-1674页

作者： Jixun Yao Qing Wang Pengcheng Guo Ziqian Ning Yuguang Yang Yu Pan Lei Xie Audio Speech and Language Processing Group School of Computer Science Northwestern Polytechnical University Xi'an Shaanxi China Department of Electronic & Computer Engineering Hong Kong University of Science and Technology Hong Kong SAR China Kyushu University Fukuoka Japan

Speaker anonymization is an effective privacy protection solution designed to conceal the speaker's identity while preserving the linguistic content and para-linguistic information of the original speech. While most prior studies focus solely on a single language, an ideal speaker anonymization system should be capable of handling multiple languages. This paper proposes MUSA, a MUlti-lingual Speaker Anonymization approach that employs a serial disentanglement strategy to perform a step-by-step disentanglement from a global time-invariant representation to a temporal time-variant representation. By utilizing semantic distillation and self-supervised speaker distillation, the serial disentanglement strategy can avoid strong inductive biases and exhibit superior generalization performance across different languages. Meanwhile, we propose a straightforward anonymization strategy that employs empty embedding with zero values to simulate the speaker identity concealment process, eliminating the need for conversion to a pseudo-speaker identity and thereby reducing the complexity of speaker anonymization process. Experimental results on VoicePrivacy official datasets and multi-lingual datasets demonstrate that MUSA can effectively protect speaker privacy while preserving linguistic content and para-linguistic information.

关键词： Data privacy Information integrity Information filtering speech processing Training Semantics Multilingual Vectors Protection Privacy

来源：评论

学校读者我要写书评

暂无评论

On language Spaces, Scales and Cross-Lingual Transfer of UD Parsers 26

On Language Spaces, Scales and Cross-Lingual Transfer of UD ...

引用

26th Conference on Computational Natural language Learning, CoNLL 2022 collocated and co-organized with EMNLP 2022

作者： Samardžić, Tanja Gutierrez-Vasque, Ximena Van Der Goot, Rob Müller-Eberstein, Max Pelloni, Olga Plank, Barbara Text Group URPP Language and Space University of Zurich Switzerland Department of Computer Science IT University of Copenhagen Denmark Center for Information and Language Processing LMU Munich Germany

ISBN: (纸本)9781959429074

Cross-lingual transfer of parsing models has been shown to work well for several closelyrelated languages, but predicting the success in other cases remains hard. Our study is a comprehensive analysis of the impact of linguistic distance on the transfer of Universal Dependencies (UD) parsers. As an alternative to syntactic typological distances extracted from URIEL, we propose three text-based feature spaces and show that they can be more precise predictors, especially on a more local scale, when only shorter distances are taken into account. Our analysis also reveals that the good coverage in typological databases is not among the factors that explain good transfer. ©2022 Association for Computational Linguistics.

关键词： Syntactics

来源：评论

学校读者我要写书评

暂无评论

LONGEMBED: Extending Embedding Models for Long Context Retrieval

LONGEMBED: Extending Embedding Models for Long Context Retri...

引用

2024 Conference on Empirical Methods in Natural language processing, EMNLP 2024

作者： Zhu, Dawei Wang, Liang Yang, Nan Song, Yifan Wu, Wenhao Wei, Furu Li, Sujian School of Computer Science Peking University China National Key Laboratory for Multimedia Information Processing Peking University China Jiangsu Collaborative Innovation Center for Language Ability Jiangsu Normal University China Microsoft Corporation United States

ISBN: (纸本)9798891761643

Embedding models play a pivotal role in modern NLP applications such as document retrieval. However, existing embedding models are limited to encoding short documents of typically 512 tokens, restrained from application scenarios requiring long inputs. This paper explores context window extension of existing embedding models, pushing their input length to a maximum of 32,768. We begin by evaluating the performance of existing embedding models using our newly constructed LONGEMBED benchmark, which includes two synthetic and four real-world tasks, featuring documents of varying lengths and dispersed target information. The benchmarking results highlight huge opportunities for enhancement in current models. Via comprehensive experiments, we demonstrate that training-free context window extension strategies can effectively increase the input length of these models by several folds. Moreover, comparison of models using Absolute Position Encoding (APE) and Rotary Position Encoding (RoPE) reveals the superiority of RoPE-based embedding models in context window extension, offering empirical guidance for future models. Our benchmark, code and trained models will be released to advance the research in long context embedding models. © 2024 Association for Computational Linguistics.

关键词： Encoding (symbols)

来源：评论

学校读者我要写书评

暂无评论

Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-speech Synthesis

Incremental Disentanglement for Environment-Aware Zero-Shot ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Ye-Xin Lu Hui-Peng Du Zheng-Yan Sheng Yang Ai Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

This paper proposes an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method, dubbed IDEA-TTS, that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environment embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned into the decoder for environment-aware speech generation. Experimental results demonstrate that IDEA-TTS achieves superior performance in the environment-aware TTS task, excelling in speech quality, speaker similarity, and environmental similarity. Additionally, IDEA-TTS is also capable of the acoustic environment conversion task and achieves state-of-the-art performance.

关键词： speech enhancement Acoustics Text to speech Decoding Data mining Spectrogram

来源：评论

学校读者我要写书评

暂无评论

Towards robust one-shot voice conversion with cycle phonetic posteriorgrams and multi-scale speaker representations 24

Towards robust one-shot voice conversion with cycle phonetic...

引用

24th International Congress on Acoustics, ICA 2022

作者： Chen, Yannian Liu, Lijuan Hu, Yajun Ling, Zhenhua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China IFLYTEK Research IFLYTEK Co. Ltd. China

One-shot voice conversion (VC) aims to convert the voice across arbitrary speakers even unseen during training, with only one reference utterance from the target speaker. It is still a challenging task as both content and speaker representations estimated from speech are required to be reliable. In this paper, we propose a novel method which combines phonetic posteriorgrams (PPGs) and multi-scale speaker representations to achieve robust one-shot VC. PPGs are extracted by a pretrained automatic speech recognition (ASR) model and contain robust linguistic information. Cycle PPGs which are generated from a cycle conversion process are used for training to eliminate the influence of residual speaker information in PPGs. Furthermore, multi-scale speaker representations composed of global and local ones are utilized. Global speaker representations are modeled by an advanced speaker embedding network which integrates squeeze-excitation blocks and attentive statistics pooling to get utterance-level vectors. In order to extract time-varying and content-dependent local speaker representations, an attention mechanism is adopted to select the most suitable features depending on each content frame, which is expected to refine the coarse speaker information given by utterance-level speaker representations. Experimental results showed that the proposed method outperformed baseline methods on one-shot VC. © 2022 Proceedings of the International Congress on Acoustics. All Rights Reserved.

关键词： speech recognition

来源：评论

学校读者我要写书评

暂无评论

Bitext Mining for Low-Resource languages via Contrastive Learning

arXiv

引用

arXiv 2022年

作者： Tan, Weiting Koehn, Philipp Center for Language and Speech Processing Computer Science Department Johns Hopkins University United States

Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of language models fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pashto. © 2022, CC BY.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：