检索结果-内蒙古大学图书馆

arXiv 2023年

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems’ performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images. Copyright © 2023, The Authors. All rights reserved.

关键词： Ultrasonics

来源：评论

学校读者我要写书评

暂无评论

Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion

Pitch-and-Spectrum-Aware Singing Quality Assessment with Bia...

引用

IEEE Spoken language Technology Workshop

作者： Yu-Fei Shi Yang Ai Ye-Xin Lu Hui-Peng Du Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities.

关键词： Training Measurement Histograms Accuracy Conferences Training data Predictive models Data models Quality assessment Data mining

来源：评论

学校读者我要写书评

暂无评论

Automatic Channel Selection and Spatial Feature Integration for Multi-Channel speech Recognition Across Various Array Topologies

Automatic Channel Selection and Spatial Feature Integration ...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Bingshen Mu Pengcheng Guo Dake Guo Pan Zhou Wei Chen Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xian China Space AI Li Auto

Automatic speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multiframe cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.

关键词：

来源：评论

学校读者我要写书评

暂无评论

MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towards High Sampling Rate and Low Bitrate Scenarios

MDCTCodec: A Lightweight MDCT-Based Neural Audio Codec Towar...

引用

IEEE Spoken language Technology Workshop

作者： Xiao-Hang Jiang Yang Ai Rui-Chen Zheng Hui-Peng Du Ye-Xin Lu Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei P. R. China

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.

关键词： Training Codecs Codes Quantization (signal) Audio coding Conferences Bit rate Vectors Decoding Discrete cosine transforms

来源：评论

学校读者我要写书评

暂无评论

Implicit Neural Representations for Robust Joint Sparse-View CT Reconstruction

arXiv

引用

arXiv 2024年

作者： Shi, Jiayang Zhu, Junyi Pelt, Daniël M. Joost Batenburg, K. Blaschko, Matthew B. Leiden Institute of Advanced Computer Science Leiden University Netherlands Center for Processing Speech and Images KU Leuven Belgium

Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves scanning similar subjects, we propose a novel approach to improve reconstruction quality through joint reconstruction of multiple objects using INRs. This approach can potentially utilize the advantages of INRs and the common patterns observed across different objects. While current INR joint reconstruction techniques primarily focus on speeding up the learning process, they are not specifically tailored to enhance the final reconstruction quality. To address this gap, we introduce a novel INR-based Bayesian framework integrating latent variables to capture the common patterns across multiple objects under joint reconstruction. The common patterns then assist in the reconstruction of each object via latent variables, thereby improving the individual reconstruction. Extensive experiments demonstrate that our method achieves higher reconstruction quality with sparse views and remains robust to noise in the measurements as indicated by common numerical metrics. The obtained latent variables can also serve as network initialization for the new object and speed up the learning process. © 2024, CC BY.

关键词： computerized tomography

来源：评论

学校读者我要写书评

暂无评论

Leveraging Prompt Learning and Pause Encoding for Alzheimer's Disease Detection

Leveraging Prompt Learning and Pause Encoding for Alzheimer'...

引用

International Symposium on Chinese Spoken language processing

作者： Yin-Long Liu Rui Feng Jia-Hong Yuan Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei Interdisciplinary Research Center for Linguistic Sciences University of Science and Technology of China Hefei

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

Compared to other clinical screening techniques, speech-and-language-based automated Alzheimer's disease (AD) detection methods are characterized by their non-invasiveness, cost-effectiveness, and convenience. Previous studies have demonstrated the efficacy of fine-tuning pre-trained language models (PLMs) for AD detection. However, the objective of this traditional fine-tuning method, which involves inputting only transcripts, is inconsistent with the masked language modeling (MLM) task used during the pre-training phase of PLMs. In this paper, we investigate prompt-based fine-tuning of PLMs, converting the classification task into a MLM task by inserting prompt templates into the transcript inputs. We also explore the impact of incorporating pause information from forced alignment into manual transcripts. Additionally, we compare the performance of various automatic speech recognition (ASR) models and select the Whisper model to generate ASR-based transcripts for comparison with manual transcripts. Furthermore, majority voting and ensemble techniques are applied across different PLMs (BERT and RoBERTa) using different random seeds. Ultimately, we obtain maximum detection accuracy of 95.8% (with mean 87.9%, std 3.3%) using manual transcripts, achieving state-of-the-art performance for AD detection using only transcripts on the ADReSS test set.

关键词： Accuracy Manuals Bidirectional control Linguistics Feature extraction Encoding Alzheimer's disease speech processing Automatic speech recognition

来源：评论

学校读者我要写书评

暂无评论

CoUDA: Coherence Evaluation via Unified Data Augmentation

CoUDA: Coherence Evaluation via Unified Data Augmentation

引用

2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human language Technologies, NAACL 2024

作者： Zhu, Dawei Wu, Wenhao Song, Yifan Zhu, Fangwei Cao, Ziqiang Li, Sujian School of Computer Science Peking University China National Key Laboratory for Multimedia Information Processing Peking University China Institute of Artificial Intelligence Soochow University China Jiangsu Collaborative Innovation Center for Language Ability Jiangsu Normal University China

ISBN: (纸本)9798891761148

Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance. In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named COUDA. COUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively. Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, COUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse. Extensive experiments in coherence evaluation show that, with only 233M parameters, COUDA achieves state-ofthe-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics. © 2024 Association for Computational Linguistics.

关键词：

来源：评论

学校读者我要写书评

暂无评论

PNP-RKD: A Positive-Negative Pair based Relational Knowledge Distillation Method for Cross-Domain Speaker Verification

PNP-RKD: A Positive-Negative Pair based Relational Knowledge...

引用

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Qing Gu Yan Song Nan Jiang Pengfei Cai Ian McLoughlin National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China ICT Cluster Singapore Institute of Technology Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Existing deep embedding learning based speaker verification (SV) methods suffer from performance degradation under domain shift conditions. This can be alleviated through unsupervised domain adaptation (UDA) techniques. While UDA improves global statistical consistency across domains, discriminative information may be overlooked or misaligned in the process. To combat this, we propose PNP-RKD, a relational knowledge distillation method that utilizes positive and negative pairs from both the source and target domains within a multitask learning framework. Two auxiliary tasks are conducted separately in the source and target domains to support PNP-RKD. Embeddings are learned in a supervised fashion from the labeled source domain, providing a robust foundation of prior knowledge. For the unlabeled target domain, we apply contrastive learning based on swapped prediction, a key component that enhances noise robustness and improves the quality of learned prototypes. More importantly, it facilitates reliable sampling in PNP-RKD, thereby enhancing the alignment of discriminative knowledge across domains. Extensive experiments conducted on the NIST SRE16 and SRE18 datasets demonstrate the superior performance of the proposed PNP-RKD method, achieving EERs of 6.83% and 8.28%, respectively.

关键词： Degradation Prototypes Contrastive learning NIST Signal processing Multitasking Acoustics Noise robustness Reliability speech processing

来源：评论

学校读者我要写书评

暂无评论

Corrective Retrieval Augmented Generation

arXiv

引用

arXiv 2024年

作者： Yan, Shi-Qi Gu, Jia-Chen Zhu, Yun Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China Department of Computer Science University of California Los Angeles United States Google DeepMind United Kingdom

Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return suboptimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches. Copyright © 2024, The Authors. All rights reserved.

关键词： Structured Query language

来源：评论

学校读者我要写书评

暂无评论

HPCNet: Hybrid Pixel and Contour Network for Audio-Visual speech Enhancement with Low-Quality Video

引用

IEEE Journal on Selected Topics in Signal processing 2025年

作者： Chen, Hang Zhang, Chen-Yue Wang, Qing Du, Jun Siniscalchi, Sabato Marco Xiong, Shi-Fu Wan, Gen-Shun University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Anhui Hefei China University of Palermo Palermo Italy IFlytek Research Anhui Hefei China

To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality *** proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset. 11Dataset and source code will be available at https://***/coal-boss/HPCNet. © 2007-2012 IEEE.

关键词： Contour followers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：