检索结果-内蒙古大学图书馆

Integrating Time-Frequency Domain Shallow and Deep Features for speech-EEG Match-Mismatch of Auditory Attention Decoding

引用

Journal of Shanghai Jiaotong University (science) 2025年 1-7页

作者： Zhang, Yubang Zhu, Qiushi Xu, Qingtian Zhang, Jie National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230026 China

Electroencephalogram (EEG) signals provide an important pathway to reflect brain activations, from which auditory attention clues of the listener can be decoded, termed as auditory attention decoding (AAD). However, existing AAD methods primarily rely on temporal or frequency features of audio and shallow features of EEG. In this work, we propose a new model fusion based AAD method with residual dilated convolution blocks, which considers both shallow and deep attention mechanisms as well as time-frequency domain features. Besides, EEG data from different utterances are mixed with the selected EEG segment for augmentation to increase the sample diversity. The effectiveness of our approach is verified by the match-mismatch task of ICASSP2024 Auditory EEG Challenge, which is a typical example of AAD. It performs much better than the baseline and state-of-the-art methods. © Shanghai Jiao Tong University 2025.

关键词： Electroencephalography

来源：评论

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm 14

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Com...

引用

14th International Symposium on Chinese Spoken language processing, ISCSLP 2024

作者： Du, Hui-Peng Ai, Yang Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (纸本)9798331516826

This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm. ©2024 IEEE.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

APNet2: High-Quality and High-Efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra 1

引用

18th National Conference on Man-Machine speech Communication, NCMMSC 2023

作者： Du, Hui-Peng Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (数字)9789819706013

ISBN: (纸本)9789819706006

In our previous work, we have proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). The APNet vocoder demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6 ms), our proposed APNet2 vocoder outperforms the original APNet and Vocos in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

Which is more faithful,seeing or saying? Multimodal sarcasm detection exploiting contrasting sentiment knowledge

引用

CAAI Transactions on Intelligence Technology 2025年第2期10卷 375-386页

作者： Yutao Chen Shumin Shi Heyan Huang School of Computer Science and Technology Beijing Institute of TechnologyBeijingChina Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications BeijingChina

Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly ***,detecting sarcasm in various forms of communication can be difficult due to conflicting *** this paper,we introduce a contrasting sentiment-based model for multimodal sarcasm detection(CS4MSD),which identifies inconsistent emotions by leveraging the CLIP knowledge module to produce sentiment features in both text and ***,five external sentiments are introduced to prompt the model learning sentimental preferences among ***,we highlight the importance of verbal descriptions embedded in illustrations and incorporate additional knowledge-sharing modules to fuse such imagelike *** results demonstrate that our model achieves state-of-the-art performance on the public multimodal sarcasm dataset.

关键词： CLIP image-text classification knowledge fusion multi-modal sarcasm detection

来源：评论

学校读者我要写书评

暂无评论

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study 4

Automating Sound Change Prediction for Phylogenetic Inferenc...

引用

4th International Workshop on Computational Approaches to Historical language Change, LChange 2023

作者： Chang, Kalvin Robinson, Nathaniel R. Cai, Anna Chen, Ting Zhang, Annie Mortensen, David R. School of Computer Science Carnegie Mellon University United States Center for Language and Speech Processing Johns Hopkins University United States

ISBN: (纸本)9798891760431

We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound change steps between historical protoforms and their modern descendants, replacing a linguistic expert in part of a parsimony-based phylogenetic inference algorithm. In our best experiments on Tukanoan languages, this method produces trees with a Generalized Quartet Distance of 0.12 from a tree that used expert annotations, a significant improvement over other semi-automated baselines. We discuss potential benefits and drawbacks to our neural approach and parsimony-based tree prediction. We also experiment with a minimal generalization learner for automatic sound law induction, finding it less effective than sound laws from expert annotation. Our code is publicly available. © 2023 Association for Computational Linguistics.

关键词： Inference engines

来源：评论

学校读者我要写书评

暂无评论

System 1 Description of BV-SLP for Sindhi-English Machine Translation in MultiIndic22MT 2024 Shared Task 9

System 1 Description of BV-SLP for Sindhi-English Machine Tr...

引用

9th Conference on Machine Translation, WMT 2024

作者： Joshi, Nisheeth Katyayan, Pragya Arora, Palak Nathani, Bharti Speech and Language Processing Lab Banasthali Vidyapith Rajasthan India School of Computer Science University of Petroleum and Energy Studies Uttrakhand India

ISBN: (纸本)9798891761797

This paper presents our machine translation system that was developed for the WAT2024 MultiIndic MT shared task. We built our system for the Sindhi-English language pair. We developed two MT systems. The first system was our baseline system where Sindhi was translated into English. In the second system, we used Hindi as a pivot for the translation of text. In both the cases, we had identified the name entities and translated them into English as a preprocessing step. Once this was done, the standard NMT process was followed to train and generate MT outputs for the task. The systems were tested on the hidden dataset of the shared task ©2024 Association for Computational Linguistics.

关键词： Machine translation

来源：评论

学校读者我要写书评

暂无评论

Improved G723.1 Codec speech Quality Under Burst Packet Loss Conditions 5th

Improved G723.1 Codec Speech Quality Under Burst Packet Loss...

引用

5th International Conference on Electrical Engineering and Control Applications, ICEECA 2022

作者： Bakri, Adil Mahdjane, Karima Amrouche, Abderrahmane Krobba, Ahmed Scientific Research and Technical Center for the Development of Arabic Language CRSTDLA Algiers Algeria Speech Communication and Signal Processing Laboratory Faculty of Electronics and Computer Science USTHB Algiers Algeria

ISBN: (纸本)9789819747757

In this paper, a Packet Loss Concealment (PLC) algorithm is proposed for G723.1 CELP-type speech coders in order to improve the quality of decoded speech in VoIP under burst packet loss. The original PLC method implemented in this codec is based on the generation of excitations. The generated excitations are used as input to the LP synthesis filter based on past LP coefficients. In an easier way, it relies on the last frame, which is well received to generate a new frame that replaces the lost frame. The proposed PLC is designed as a PLC algorithm for G.723.1. The proposed method performance is compared with the PLC algorithm employed in G.723.1, through the Perceptual Evaluation of speech Quality (PESQ). The experiments have shown that the proposed PLC algorithm provides significant enhancement of the speech quality than the PLC of G.723.1, especially in the presence of burst packet loss and voice onset conditions. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Packet loss

来源：评论

学校读者我要写书评

暂无评论

Zero-Shot Personalized Lip-To-speech Synthesis with Face Image Based Voice Control 48

Zero-Shot Personalized Lip-To-Speech Synthesis with Face Ima...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Sheng, Zheng-Yan Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

Lip-to-speech (Lip2speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2speech synthesis with a face image rather than reference audio to control voice characteristics. © 2023 IEEE.

关键词： speech synthesis

来源：评论

学校读者我要写书评

暂无评论

Neural speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses 48

Neural Speech Phase Prediction Based on Parallel Estimation ...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed. © 2023 IEEE.

关键词： Group delay

来源：评论

学校读者我要写书评

暂无评论

Sample-Efficient Unsupervised Domain Adaptation of speech Recognition Systems: A Case Study for Modern Greek

引用

IEEE/ACM Transactions on Audio speech and language processing 2024年 32卷 286-299页

作者： Paraskevopoulos, Georgios Kouzelis, Theodoros Rouvalis, Georgios Katsamanis, Athanasios Katsouros, Vassilis Potamianos, Alexandros National Technical University of Athens Graduate School of Electrical and Computer Engineering Athens10682 Greece Athena Research Center Institute for Speech and Language Processing Marousi15125 Greece National Technical University of Athens Faculty of Electrical and Computer Engineering Athens10682 Greece

Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data is limited. In this work, we propose M2DS2, a simple and sample-efficient fine-tuning strategy for large pre-trained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a 120-hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments, we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines. © 2014 IEEE.

关键词： speech recognition

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：