检索结果-内蒙古大学图书馆

39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

作者： Wang, Chenxu Jian, Ping Yang, Zhen School of Computer Science and Technology Beijing Institute of Technology Beijing China Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications Beijing Institute of Technology Beijing China

ISBN: (纸本)157735897X

Logical reading comprehension is a challenging task that involves understanding the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model’s capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning paths between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0). © 2025, Association for the Advancement of Artificial Intelligence (***). All rights reserved.

关键词： Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

CSDNet: cross-sketch with dual gated attention for fine-grained image captioning network

引用

Multimedia Tools and Applications 2024年 1-28页

作者： Hossain, Md. Shamim Aktar, Shamima Hossen, Md. Bipul Hossain, Mohammad Alamgir Gu, Naijie Huang, Zhangjin School of Computer Science and Technology University of Science and Technology of China Anhui Hefei230027 China Deqing Alpha Innovation Institute Huzhou313299 China Department of Mathematics Jashore University of Science and Technology Jashore7408 Bangladesh Department of Statistics Begum Rokeya University Rangpur5404 Bangladesh National Engineering Laboratory for Speech and Language Information Processing University of Science and Technology of China Anhui Hefei230027 China

In the realm of extracting inter and intra-modal interactions, contemporary models often face challenges such as reduced computational efficiency, particularly when dealing with lengthy visual sequences. To address these issues, this study introduces an innovative model, the Cross-Sketch with Dual Gated Attention Network (CSDNet), designed to handle second-order intra- and inter-modal interactions by integrating a couple of attention modules. Leveraging bilinear pooling to effectively capture these second-order interactions typically requires substantial computational resources due to the processing of large-dimensional tensors. Due to these resource demands, the first module Cross-Sketch Attention (CSA) is proposed, which employs Cross-Tensor Sketch Pooling on attention features to reduce dimensionality while preserving crucial information without sacrificing caption quality. Furthermore, to enhance caption by integrating another novel attention module, Dual Gated Attention (DGA), which contributes additional spatial and channel-wise attention distributions to improve caption generation performance. Our method demonstrates significant computational efficiency improvements, reducing computation time per epoch by an average of 13.54% compared to the base model, which leads to expedited convergence and improved performance metrics. Additionally, we observe a 0.07% enhancement in the METEOR score compared to the base model. Through the application of reinforcement learning optimization, our model achieves a remarkable CIDEr-D score of 132.2% on the MS-COCO dataset. This consistently outperforms baseline performance across a comprehensive range of evaluation metrics. © The Author(s), under exclusive licence to Springer science+Business Media, LLC, part of Springer Nature 2024.

关键词： Tensors

来源：评论

学校读者我要写书评

暂无评论

Anchored Monotonic Alignment and Representation Substitution for Rare Spontaneous Behaviors in Spontaneous speech Synthesis

Anchored Monotonic Alignment and Representation Substitution...

引用

2025 IEEE International Conference on Acoustics, speech, and Signal processing, ICASSP 2025

作者： Wu, Ning-Qian Hu, Ya-Jun Chen, Liping Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China iFLYTEK Research iFLYTEK Co. Ltd. China MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition University of Science and Technology of China China

ISBN: (纸本)9798350368741

Spontaneous behaviors in speech pose significant challenges for speech synthesis. Existing research has not adequately addressed these behaviors, with most studies relying on specially recorded datasets. In contrast, real-world data more accurately reflects the natural, spontaneous speaking styles in everyday life and encompasses a wider range of spontaneous behaviors. However, such data is often of lower quality, and the distribution of spontaneous behaviors is highly imbalanced. In this study, we explore spontaneous speech synthesis using real-world data within the VITS2 framework. To overcome these challenges, we introduce two techniques: anchored monotonic alignment and spontaneous hidden representation substitution. Experimental results demonstrate that these methods enhance model alignment and improve the naturalness of the generated speech. Our proposed approach successfully addresses the challenge of synthesizing rare spontaneous behaviors and offers users flexible control over the synthesized speech. © 2025 IEEE.

关键词： real-world data speech synthesis spontaneous behaviors spontaneous speech synthesis VITS

来源：评论

学校读者我要写书评

暂无评论

MULTI-CROSSRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction 24

MULTI-CROSSRE A Multi-Lingual Multi-Domain Dataset for Relat...

引用

24th Nordic Conference on Computational Linguistics, NoDaLiDa 2023

作者： Bassignana, Elisa Ginter, Filip Pyysalo, Sampo van der Goot, Rob Plank, Barbara Department of Computer Science IT University of Copenhagen Denmark TurkuNLP Department of Computing University of Turku Finland MaiNLP Center for Information and Language Processing LMU Munich Germany

ISBN: (纸本)9789916219997

Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose MULTI-CROSSRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. MULTI-CROSSRE is a machine translated version of CrossRE (Bassignana and Plank, 2022a), with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers. We run a baseline model over the 26 new datasets and—as sanity check—over the 26 back-translations to English. Results on the back-translated data are consistent with the ones on the original English CrossRE, indicating high quality of the translation and the resulting dataset. © 2023 Association for Computational Linguistics.

关键词： Data assimilation

来源：评论

学校读者我要写书评

暂无评论

A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

arXiv

引用

arXiv 2025年

作者： Jiang, Xiao-Hang Ai, Yang Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei230027 China

This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications. Copyright © 2025, The Authors. All rights reserved.

关键词： Discrete cosine transforms

来源：评论

学校读者我要写书评

暂无评论

Subject Disentanglement Neural Network for speech Envelope Reconstruction from EEG

Subject Disentanglement Neural Network for Speech Envelope R...

引用

IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

作者： Li Zhang Jiyao Liu Lei Xie Audio Speech and Language Processing Group (ASLP) School of Computer Science Northwestern Polytechnical University (NPU) Xi’an China

ISBN: (数字)9798350386226

ISBN: (纸本)9798350386233

Reconstructing speech envelopes from EEG signals is essential for exploring neural mechanisms underlying speech perception. Yet, EEG variability across subjects and physiological artifacts complicate accurate reconstruction. To address this problem, we introduce Subject Disentangling Neural Network (SDN-Net), which disentangles subject identity information from reconstructed speech envelopes to enhance cross-subject reconstruction accuracy. SDN-Net integrates three key components: MLA-Codec, MPN-MI, and CTA-MTDNN. The MLA-Codec, a fully convolutional neural network, decodes EEG signals into speech envelopes. The CTA-MTDNN module, a multi-scale time-delay neural network with channel and temporal attention, extracts subject identity features from EEG signals. Lastly, the MPN-MI module, a mutual information estimator with a multilayer perceptron, supervises the removal of subject identity information from the reconstructed speech envelope. Experiments on the Auditory EEG Decoding Dataset demonstrate that SDN-Net achieves superior performance in inner- and cross-subject speech envelope reconstruction compared to recent state-of-the-art methods.

关键词： Accuracy speech coding Neural networks speech enhancement Feature extraction Electroencephalography Physiology Decoding Convolutional neural networks Mutual information

来源：评论

学校读者我要写书评

暂无评论

STAGE-WISE AND PRIOR-AWARE NEURAL speech PHASE PREDICTION

arXiv

引用

arXiv 2024年

作者： Liu, Fei Ai, Yang Du, Hui-Peng Lu, Ye-Xin Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper proposes a novel Stage-wise and Prior-aware Neural speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency. Copyright © 2024, The Authors. All rights reserved.

关键词： Prediction models

来源：评论

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm

arXiv

引用

arXiv 2024年

作者： Du, Hui-Peng Ai, Yang Zheng, Rui-Chen Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm. Copyright © 2024, The Authors. All rights reserved.

关键词： Discriminators

来源：评论

学校读者我要写书评

暂无评论

MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios

arXiv

引用

arXiv 2024年

作者： Jiang, Xiao-Hang Ai, Yang Zheng, Rui-Chen Du, Hui-Peng Lu, Ye-Xin Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus. Copyright © 2024, The Authors. All rights reserved.

关键词： Cosine transforms

来源：评论

学校读者我要写书评

暂无评论

Multi-Stage speech Bandwidth Extension with Flexible Sampling Rate Control

arXiv

引用

arXiv 2024年

作者： Lu, Ye-Xin Ai, Yang Sheng, Zheng-Yan Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed MS-BWE model comprises a cascade of BWE blocks, with each block featuring a dual-stream architecture to realize amplitude and phase extension, progressively painting the speech frequency bands stage by stage. The teacher-forcing strategy is employed to mitigate the discrepancy between training and inference. Experimental results demonstrate that our proposed MS-BWE is comparable to state-of-the-art speech BWE methods in speech quality. Regarding generation efficiency, the one-stage generation of MS-BWE can achieve over one thousand times real-time on GPU and about sixty times on CPU. Copyright © 2024, The Authors. All rights reserved.

关键词： Bandwidth

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：