检索结果-内蒙古大学图书馆

Incorporating Ultrasound Tongue Images for Audio-Visual speech Enhancement

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most. Copyright © 2023, The Authors. All rights reserved.

关键词： Ultrasonics

Long-frame-shift Neural speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Ai, Yang Lu, Ye-Xin Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms. Copyright © 2023, The Authors. All rights reserved.

关键词： Error compensation

Exploring language-Agnostic speech Representations Using Domain Knowledge for Detecting Alzheimer’s Dementia

学校读者我要写书评

暂无评论

Exploring Language-Agnostic Speech Representations Using Dom...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Zehra Shah Shi-Ang Qi Fei Wang Mahtab Farrokh Mashrura Tasnim Eleni Stroulia Russell Greiner Manos Plitsis Athanasios Katsamanis Department of Computing Science University of Alberta Edmonton Canada Institute for Language and Speech Processing Athena Research Center Greece

We explore ways to use speech data to screen for indications of Alzheimer’s dementia (AD). In particular, we describe our approach to the ICASSP 2023 Signal processing Grand Challenge, which involves extrapolating from models learned from English speech samples, to Greek speech samples, to determine which subjects have AD. By using acoustic and linguistic features, inspired by clinical research on AD, our top-performing classification model achieves 69% accuracy in distinguishing AD patients from healthy controls, and our regression model attains an RMSE of 4.8 for inferring cognitive testing scores. These outcomes underscore the potential of our explainable model for detecting cognitive decline in AD patients via speech, and its applicability in clinical settings.

关键词： Signal processing Linguistics Acoustics speech processing Alzheimer's disease Testing

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality speech Enhancement

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua The National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech. Remarkably, for the speech denoising task, the proposed MP-SENet yields a PESQ of 3.60 on the VoiceBank+DEMAND dataset and 3.62 on the DNS challenge dataset. Copyright © 2023, The Authors. All rights reserved.

关键词： speech intelligibility

language-Independent Prosody-Enhanced speech Representations For Multilingual speech Synthesis

学校读者我要写书评

暂无评论

Language-Independent Prosody-Enhanced Speech Representations...

IEEE Spoken language Technology Workshop

作者： Chang Liu Zhen-Hua Ling Ya-Jun Hu National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China iFLYTEK Co. Ltd. China

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

This paper proposes language-independent prosody-enhanced speech representations to improve the naturalness of speech synthesis for the target languages that lack prosodic labels. To build text-to-speech (TTS) systems for low-resource languages, recent studies have employed the representations extracted from self-supervised learning (SSL) speech models, such as wav2vec 2.0, as intermediate representations in TTS models. However, they have generally focused only on the linguistic and phonetic information in SSL representations, disregarding the prosodic information. This paper investigates the prosodic information contained in the multilingual wav2vec 2.0 model through layer-wise probing tests utilizing acoustic prosodic features and prosodic labels. Furthermore, we propose a language-independent prosody enhancement approach to improve the prosodic properties of SSL models. The proposed method introduces a prosodic label prediction loss to fine-tune wav2vec 2.0 model with multilingual prosody-annotated corpora. From the fine-tuned wav 2 vec 2.0 model, the language-independent prosody-enhanced speech representations are extracted and serve as intermediate representations of our acoustic model in the downstream TTS task. The experimental results on six target languages demonstrate that our proposed prosody-enhanced speech representations outperform the original wav2vec 2.0 representations without enhancement.

关键词： Correlation Conferences Self-supervised learning speech enhancement Predictive models Phonetics Transformers Acoustics Multilingual Text to speech

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Blaschke, Verena Kovačić, Barbara Peng, Siyao Schütze, Hinrich Plank, Barbara Center for Information and Language Processing LMU Munich Germany Munich Germany Department of Computer Science IT University of Copenhagen Denmark

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in ‘within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available. © 2024, CC BY-NC-SA.

关键词： Syntactics

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cai, Pengfei Song, Yan Jiang, Nan Gu, Qing McLoughlin, Ian National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China ICT Cluster Singapore Institute of Technology Singapore

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model (PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique. © 2024, CC BY.

关键词： Self-supervised learning

Sagalee: an Open Source Automatic speech Recognition Dataset for Oromo language

学校读者我要写书评

暂无评论

Sagalee: an Open Source Automatic Speech Recognition Dataset...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： Turi Abu Ying Shi Thomas Fang Zheng Dong Wang Center for Speech and Language Technologies BNRist Beijing Department of Computer Science and Technology Tsinghua University Beijing China School of Computer Science and Technology Harbin Institute of Technology Harbin China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

We present a novel Automatic speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowdsourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://***/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.

关键词： Crowdsourcing Error analysis Signal processing Phonetics Audio recording Acoustics Noise measurement speech processing Research and development Automatic speech recognition

Boosting Multi-Speaker Expressive speech Synthesis with Semi-Supervised Contrastive Learning

学校读者我要写书评

暂无评论

Boosting Multi-Speaker Expressive Speech Synthesis with Semi...

IEEE International Conference on Multimedia and Expo (ICME)

作者： Xinfa Zhu Yuke Li Yi Lei Ning Jiang Guoqing Zhao Lei Xie Audio Speech and Language Processing Group (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Mashang Consumer Finance Co. Ltd

ISBN: (数字)9798350390155

ISBN: (纸本)9798350390162

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker’s speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

关键词： Training Representation learning Contrastive learning speech Boosting speech synthesis