检索结果-内蒙古大学图书馆

Integrating Time-Frequency Domain Shallow and Deep Features for speech-EEG Match-Mismatch of Auditory Attention Decoding

引用

Journal of Shanghai Jiaotong University (Science) 2025年 1-7页

作者： Zhang, Yubang Zhu, Qiushi Xu, Qingtian Zhang, Jie National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230026 China

Electroencephalogram (EEG) signals provide an important pathway to reflect brain activations, from which auditory attention clues of the listener can be decoded, termed as auditory attention decoding (AAD). However, existing AAD methods primarily rely on temporal or frequency features of audio and shallow features of EEG. In this work, we propose a new model fusion based AAD method with residual dilated convolution blocks, which considers both shallow and deep attention mechanisms as well as time-frequency domain features. Besides, EEG data from different utterances are mixed with the selected EEG segment for augmentation to increase the sample diversity. The effectiveness of our approach is verified by the match-mismatch task of ICASSP2024 Auditory EEG Challenge, which is a typical example of AAD. It performs much better than the baseline and state-of-the-art methods. © Shanghai Jiao Tong University 2025.

关键词： Electroencephalography

来源：评论

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm 14

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Com...

引用

14th International Symposium on Chinese Spoken language processing, ISCSLP 2024

作者： Du, Hui-Peng Ai, Yang Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (纸本)9798331516826

This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm. ©2024 IEEE.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

APNet2: High-Quality and High-Efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra 1

引用

18th National Conference on Man-Machine speech Communication, NCMMSC 2023

作者： Du, Hui-Peng Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (数字)9789819706013

ISBN: (纸本)9789819706006

In our previous work, we have proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). The APNet vocoder demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6 ms), our proposed APNet2 vocoder outperforms the original APNet and Vocos in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights 6

LAR-ECHR: A New Legal Argument Reasoning Task and Dataset fo...

引用

6th Natural Legal language processing Workshop 2024, NLLP 2024, co-located with the 2024 Conference on Empirical Methods in Natural language processing

作者： Chlapanis, Odysseas S. Galanis, Dimitrios Androutsopoulos, Ion Department of Informatics Athens University of Economics and Business Greece Institute for Language and Speech Processing Athena Research Center Greece Archimedes Unit Athena Research Center Greece

ISBN: (纸本)9798891761834

We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems. ©2024 Association for Computational Linguistics.

关键词： Case based reasoning

来源：评论

学校读者我要写书评

暂无评论

Multilingual Synthesis of Depictions through Structured Descriptions of Sign: An Initial Case Study 11

Multilingual Synthesis of Depictions through Structured Desc...

引用

11th Workshop on the Representation and processing of Sign languages: Evaluation of Sign language Resources, sign-lang@LREC-COLING 2024

作者： McDonald, John Efthimiou, Eleni Fotinea, Stavroula-Evita Wolfe, Rosalee School of Computing DePaul University ChicagoIL United States Institute for Language and Speech Processing ATHENA Research Center Athens Greece

ISBN: (纸本)9782493814302

Sign language synthesis systems must contend with an enormous variety of possible target languages across the world, and in many locations, such as Europe, the number of sign languages that can be found in a relatively limited geographical area can be surprising. For such a synthesis system to be widely useful, it must not be limited to only one target language. This presents challenges both for the linguistic models and the animation systems that drive these displays. This paper presents a case study for animating discourse in three target languages, French, Greek and German, generated directly from the same base linguistic description. The case study exploits non-lexical constructs in sign, which are more common among sign languages, while providing a first step for synthesizing those aspects that are different. Further, it suggests a possible path forward to exploring whether linguistic structures in one sign language can be exploited in other sign languages, which might be particularly helpful in under-resourced languages. © 2024 ELRA language Resources Association: CC BY-NC 4.0.

关键词： Animation

来源：评论

学校读者我要写书评

暂无评论

Zero-Shot Personalized Lip-To-speech Synthesis with Face Image Based Voice Control 48

Zero-Shot Personalized Lip-To-Speech Synthesis with Face Ima...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Sheng, Zheng-Yan Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

Lip-to-speech (Lip2speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2speech synthesis with a face image rather than reference audio to control voice characteristics. © 2023 IEEE.

关键词： speech synthesis

来源：评论

学校读者我要写书评

暂无评论

speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training 48

Speech Reconstruction from Silent Tongue and Lip Articulatio...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition, our proposed method also outperforms the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by approximately 10%. © 2023 IEEE.

关键词： Iterative methods

来源：评论

学校读者我要写书评

暂无评论

Deepfake Algorithm Recognition System with Augmented Data for ADD 2023 Challenge

Deepfake Algorithm Recognition System with Augmented Data fo...

引用

2023 Workshop on Deepfake Audio Detection and Analysis, DADA 2023

作者： Zeng, Xiao-Min Zhang, Jian-Tao Li, Kang Liu, Zhuo-Li Xie, Wei-Lin Song, Yan National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

In this paper, we describe our submitted systems to the ADD2023 Challenge Track 3–Deepfake algorithm recognition (AR). This task requires not only identifying known deepfake algorithms in closed-set but also distinguishing unknown algorithms. By closed-set classification experiments, we select the output of the pre-trained wav2vec2.0-base model as acoustic features. Then, we apply the ECAPA-TDNN model to recognize different deepfake algorithms and determine whether the samples belong to the unknown algorithms by threshold. Besides, we adopt data augmentation to improve the generalization and robustness of our model. We evaluate our system on the ADD2023 Challenge Track 3 and achieve a 75.41% F1-score. Our submission ranked third in the deepfake algorithm recognition track of the ADD2023 Challenge. © 2023 CEUR-WS. All rights reserved.

关键词： data augmentation deepfake algorithm recognition open-set recognition

来源：评论

学校读者我要写书评

暂无评论

Neural speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses 48

Neural Speech Phase Prediction Based on Parallel Estimation ...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed. © 2023 IEEE.

关键词： Group delay

来源：评论

学校读者我要写书评

暂无评论

Within- and Between-Class Sample Interpolation Based Supervised Metric Learning for Speaker Verification 1

引用

18th National Conference on Man-Machine speech Communication, NCMMSC 2023

作者： Zhang, Jian-Tao Song, Hao-Yu Guo, Wu Song, Yan Dai, Li-Rong National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China The Australian National University Canberra Australia

ISBN: (数字)9789819706013

ISBN: (纸本)9789819706006

Metric learning aims to pull together the samples belonging to the same class and push apart those from different classes in embedding space. Existing methods may suffer from inadequate and low-quality sample pairs, resulting unsatisfactory speaker verification (SV) performance. To address this issue, we propose the data augmentation methods in the embedding space to guarantee sufficient and high-quality negative points for metric learning, termed as within-class and between-class points interpolation generation (WBIG). Furthermore, the strategy of hard negative pair mining (HDPM) is also considered in WBIG. It is shown that WBIG is simple and flexible enough to be incorporated into existing metric learning method, such as supervised contrastive loss (SCL). Experiments on CNCeleb and VoxCeleb demonstrate the superiority of WBIG, and achieve relative performance improvement in terms of EER by 9.74% and 9.95% compared to the baseline system, separately. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Interpolation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：