检索结果-内蒙古大学图书馆

Integrating Time-Frequency Domain Shallow and Deep Features for speech-EEG Match-Mismatch of Auditory Attention Decoding

引用

Journal of Shanghai Jiaotong University (science) 2025年 1-7页

作者： Zhang, Yubang Zhu, Qiushi Xu, Qingtian Zhang, Jie National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230026 China

Electroencephalogram (EEG) signals provide an important pathway to reflect brain activations, from which auditory attention clues of the listener can be decoded, termed as auditory attention decoding (AAD). However, existing AAD methods primarily rely on temporal or frequency features of audio and shallow features of EEG. In this work, we propose a new model fusion based AAD method with residual dilated convolution blocks, which considers both shallow and deep attention mechanisms as well as time-frequency domain features. Besides, EEG data from different utterances are mixed with the selected EEG segment for augmentation to increase the sample diversity. The effectiveness of our approach is verified by the match-mismatch task of ICASSP2024 Auditory EEG Challenge, which is a typical example of AAD. It performs much better than the baseline and state-of-the-art methods. © Shanghai Jiao Tong University 2025.

关键词： Electroencephalography

来源：评论

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm 14

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Com...

引用

14th International Symposium on Chinese Spoken language processing, ISCSLP 2024

作者： Du, Hui-Peng Ai, Yang Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (纸本)9798331516826

This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm. ©2024 IEEE.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

APNet2: High-Quality and High-Efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra 1

引用

18th National Conference on Man-Machine speech Communication, NCMMSC 2023

作者： Du, Hui-Peng Lu, Ye-Xin Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

ISBN: (数字)9789819706013

ISBN: (纸本)9789819706006

In our previous work, we have proposed a neural vocoder called APNet, which directly predicts speech amplitude and phase spectra with a 5 ms frame shift in parallel from the input acoustic features, and then reconstructs the 16 kHz speech waveform using inverse short-time Fourier transform (ISTFT). The APNet vocoder demonstrates the capability to generate synthesized speech of comparable quality to the HiFi-GAN vocoder but with a considerably improved inference speed. However, the performance of the APNet vocoder is constrained by the waveform sampling rate and spectral frame shift, limiting its practicality for high-quality speech synthesis. Therefore, this paper proposes an improved iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt v2 as the backbone network for amplitude and phase predictions, expecting to enhance the modeling capability. Additionally, we introduce a multi-resolution discriminator (MRD) into the GAN-based losses and optimize the form of certain losses. At a common configuration with a waveform sampling rate of 22.05 kHz and spectral frame shift of 256 points (i.e., approximately 11.6 ms), our proposed APNet2 vocoder outperforms the original APNet and Vocos in terms of synthesized speech quality. The synthesized speech quality of APNet2 is also comparable to that of HiFi-GAN and iSTFTNet, while offering a significantly faster inference speed. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

Which is more faithful,seeing or saying? Multimodal sarcasm detection exploiting contrasting sentiment knowledge

引用

CAAI Transactions on Intelligence Technology 2025年第2期10卷 375-386页

作者： Yutao Chen Shumin Shi Heyan Huang School of Computer Science and Technology Beijing Institute of TechnologyBeijingChina Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications BeijingChina

Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly ***,detecting sarcasm in various forms of communication can be difficult due to conflicting *** this paper,we introduce a contrasting sentiment-based model for multimodal sarcasm detection(CS4MSD),which identifies inconsistent emotions by leveraging the CLIP knowledge module to produce sentiment features in both text and ***,five external sentiments are introduced to prompt the model learning sentimental preferences among ***,we highlight the importance of verbal descriptions embedded in illustrations and incorporate additional knowledge-sharing modules to fuse such imagelike *** results demonstrate that our model achieves state-of-the-art performance on the public multimodal sarcasm dataset.

关键词： CLIP image-text classification knowledge fusion multi-modal sarcasm detection

来源：评论

学校读者我要写书评

暂无评论

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study 4

Automating Sound Change Prediction for Phylogenetic Inferenc...

引用

4th International Workshop on Computational Approaches to Historical language Change, LChange 2023

作者： Chang, Kalvin Robinson, Nathaniel R. Cai, Anna Chen, Ting Zhang, Annie Mortensen, David R. School of Computer Science Carnegie Mellon University United States Center for Language and Speech Processing Johns Hopkins University United States

ISBN: (纸本)9798891760431

We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound change steps between historical protoforms and their modern descendants, replacing a linguistic expert in part of a parsimony-based phylogenetic inference algorithm. In our best experiments on Tukanoan languages, this method produces trees with a Generalized Quartet Distance of 0.12 from a tree that used expert annotations, a significant improvement over other semi-automated baselines. We discuss potential benefits and drawbacks to our neural approach and parsimony-based tree prediction. We also experiment with a minimal generalization learner for automatic sound law induction, finding it less effective than sound laws from expert annotation. Our code is publicly available. © 2023 Association for Computational Linguistics.

关键词： Inference engines

来源：评论

学校读者我要写书评

暂无评论

System 1 Description of BV-SLP for Sindhi-English Machine Translation in MultiIndic22MT 2024 Shared Task 9

System 1 Description of BV-SLP for Sindhi-English Machine Tr...

引用

9th Conference on Machine Translation, WMT 2024

作者： Joshi, Nisheeth Katyayan, Pragya Arora, Palak Nathani, Bharti Speech and Language Processing Lab Banasthali Vidyapith Rajasthan India School of Computer Science University of Petroleum and Energy Studies Uttrakhand India

ISBN: (纸本)9798891761797

This paper presents our machine translation system that was developed for the WAT2024 MultiIndic MT shared task. We built our system for the Sindhi-English language pair. We developed two MT systems. The first system was our baseline system where Sindhi was translated into English. In the second system, we used Hindi as a pivot for the translation of text. In both the cases, we had identified the name entities and translated them into English as a preprocessing step. Once this was done, the standard NMT process was followed to train and generate MT outputs for the task. The systems were tested on the hidden dataset of the shared task ©2024 Association for Computational Linguistics.

关键词： Machine translation

来源：评论

学校读者我要写书评

暂无评论

Improved G723.1 Codec speech Quality Under Burst Packet Loss Conditions 5th

Improved G723.1 Codec Speech Quality Under Burst Packet Loss...

引用

5th International Conference on Electrical Engineering and Control Applications, ICEECA 2022

作者： Bakri, Adil Mahdjane, Karima Amrouche, Abderrahmane Krobba, Ahmed Scientific Research and Technical Center for the Development of Arabic Language CRSTDLA Algiers Algeria Speech Communication and Signal Processing Laboratory Faculty of Electronics and Computer Science USTHB Algiers Algeria

ISBN: (纸本)9789819747757

In this paper, a Packet Loss Concealment (PLC) algorithm is proposed for G723.1 CELP-type speech coders in order to improve the quality of decoded speech in VoIP under burst packet loss. The original PLC method implemented in this codec is based on the generation of excitations. The generated excitations are used as input to the LP synthesis filter based on past LP coefficients. In an easier way, it relies on the last frame, which is well received to generate a new frame that replaces the lost frame. The proposed PLC is designed as a PLC algorithm for G.723.1. The proposed method performance is compared with the PLC algorithm employed in G.723.1, through the Perceptual Evaluation of speech Quality (PESQ). The experiments have shown that the proposed PLC algorithm provides significant enhancement of the speech quality than the PLC of G.723.1, especially in the presence of burst packet loss and voice onset conditions. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Packet loss

来源：评论

学校读者我要写书评

暂无评论

Zero-Shot Personalized Lip-To-speech Synthesis with Face Image Based Voice Control 48

Zero-Shot Personalized Lip-To-Speech Synthesis with Face Ima...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Sheng, Zheng-Yan Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

Lip-to-speech (Lip2speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2speech synthesis with a face image rather than reference audio to control voice characteristics. © 2023 IEEE.

关键词： speech synthesis

来源：评论

学校读者我要写书评

暂无评论

speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training 48

Speech Reconstruction from Silent Tongue and Lip Articulatio...

引用

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Zheng, Rui-Chen Ai, Yang Ling, Zhen-Hua University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China

ISBN: (纸本)9781728163277

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition, our proposed method also outperforms the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by approximately 10%. © 2023 IEEE.

关键词： Iterative methods

来源：评论

学校读者我要写书评

暂无评论

Deepfake Algorithm Recognition System with Augmented Data for ADD 2023 Challenge

Deepfake Algorithm Recognition System with Augmented Data fo...

引用

2023 Workshop on Deepfake Audio Detection and Analysis, DADA 2023

作者： Zeng, Xiao-Min Zhang, Jian-Tao Li, Kang Liu, Zhuo-Li Xie, Wei-Lin Song, Yan National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

In this paper, we describe our submitted systems to the ADD2023 Challenge Track 3–Deepfake algorithm recognition (AR). This task requires not only identifying known deepfake algorithms in closed-set but also distinguishing unknown algorithms. By closed-set classification experiments, we select the output of the pre-trained wav2vec2.0-base model as acoustic features. Then, we apply the ECAPA-TDNN model to recognize different deepfake algorithms and determine whether the samples belong to the unknown algorithms by threshold. Besides, we adopt data augmentation to improve the generalization and robustness of our model. We evaluate our system on the ADD2023 Challenge Track 3 and achieve a 75.41% F1-score. Our submission ranked third in the deepfake algorithm recognition track of the ADD2023 Challenge. © 2023 CEUR-WS. All rights reserved.

关键词： data augmentation deepfake algorithm recognition open-set recognition

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：