检索结果-内蒙古大学图书馆

arXiv 2025年

作者： Wang, Tianrui Ge, Meng Gong, Cheng Qiang, Chunyu Wang, Haoyu Huang, Zikang Jiang, Yu Wang, Xiaobao Chen, Xie Wang, Longbiao Dang, Jianwu Tianjin Key Laboratory of Cognitive Computing and Application College of Intelligence and Computing Tianjin University Tianjin China Guangdong Laboratory of Artificial Intelligence and Digital Economy Guangdong China China Telecom Beijing China MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China Co. Ltd Tianjin China Shenzhen Institute of Advanced Technology Chinese Academy of Sciences Guangdong China

Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker characteristics has become a topic of great attention. Common approaches, such as full and adapter-based fine-tuning, often overlook the specific contributions of model parameters to emotion and speaker control. Treating all parameters uniformly during fine-tuning, especially when the target data has limited content diversity compared to the pre-training corpus, results in slow training speed and an increased risk of catastrophic forgetting. To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech. We then selectively fine-tune the layers with the highest and lowest characteristic-specific contributions to generate speech with target emotional expression and speaker identity. Experimental results demonstrate that our method achieves performance comparable to, or even surpassing, full fine-tuning in generating speech with specific emotional expressions and speaker identities. Additionally, CSP-FT delivers approximately 2× faster training speeds, fine-tunes only around 8% of parameters, and significantly reduces catastrophic forgetting. Furthermore, we show that codec language TTS models perform competitively with self-supervised models in speaker identification and emotion classification tasks, offering valuable insights for developing universal speech processing models. © 2025, CC BY.

关键词： C (programming language)

来源：评论

学校读者我要写书评

暂无评论

DAT: Dual-branch Adapter-Tuning for Few-shot Recognition

引用

IEEE Transactions on Circuits and Systems for Video Technology 2025年

作者： Chen, Junxi Wu, Guangxing Li, Hongxiang Chen, Jiankang Zhang, Wentao Zheng, Weishi Wang, Ruixuan Sun Yat-sen Univerisity College of Computer Science and Engineering Guangzhou510275 China Peking University School of Electronic and Computer Engineering Shenzhen518055 China Ministry of Education Key Laboratory of Machine Intelligence and Advanced Computing Guangzhou510275 China Sun Yat-sen Univerisity School of Computer Science and Engineering Guangzhou510275 China Peng Cheng Laboratory Shenzhen518066 China

Parameter-Efficient Fine-Tuning methods based on vision-language models (such as CLIP) for few-shot learning have recently received considerable attention. However, previous works only fine-tune either the image or text branch, breaking the alignment of the original two branches, meanwhile fine-tuning both branches of the CLIP would inevitably introduce more trainable parameters and likely cause more severe over-fitting due to the limited training data. In this study, we propose a novel Dual-branch Adapter-Tuning framework (DAT), which collaboratively trains the visual adapter and textual adapter added to the two branches of the original CLIP with multiple consistency constraints. By effectively utilizing the semantically detailed class-specific prompts and outputs of the original CLIP to guide the fine-tuning of both branches, our method gains exceptional adaptation ability to the downstream few-shot learning tasks and alleviates the over-fitting issue, meanwhile maximally preserving the generalization ability of the original CLIP model. Our proposed framework has achieved superior performance on diverse datasets under various few-shot learning settings compared to the existing approaches. The source code is available at https://***/SandyXi/DAT. © 1991-2012 IEEE.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

SPARK: Simple and Parameter-free Knowledge Embedding with Fuzzy Cognitive Maps for Class Incremental Learning

引用

IEEE Transactions on Fuzzy Systems 2025年

作者： Wang, Yu Xie, Jiabo Zheng, Junyan Lu, Bingxu Bi, Yanxian Hu, Qinghua Tianjin University College of Intelligence and Computing Tianjin300350 China Key Laboratory for Machine Learning of Tianjin China Haihe Laboratory of Information Technology Application Innovation Tianjin China CETC Academy of Electronics and Information Technology Group Co. Ltd. China Academy of Electronic and Information Technology China

Class incremental learning (CIL) aims to mitigate catastrophic forgetting of previously learned classes when integrating new knowledge. A primary challenge contributing to forgetting is the absence of data from earlier classes. Researchers have designed a variety of methods to solve the problem, among which topology-preserving methods show tremendous potential. However, two problems remain: (1) A large hyperparameter search space for constructing and utilizing a complex topology hinders efficient performance optimization;(2) Constraining the network to preserve the topology in the objective makes it difficult to optimize. This paper proposes SPARK, a simple and parameter-free method by embedding fuzzy cognitive maps, to address the problems. First, we construct a fuzzy cognitive map with nodes representing class prototypes and edges representing inter-class similarities. Then, we exploit the fuzzy cognitive map to obtain class-level embedding by aggregating features of other classes for each class. Finally, the class- and sample-level embeddings are fused and fed to the classifier. The proposed method can be easily optimized without introducing additional loss terms and hyperparameters. We theoretically prove that such a simple fuzzy cognitive map embedding can efficiently preserve the structural information of the fuzzy cognitive map. Experimental results indicate that SPARK achieves up to 5.85% higher average accuracy and 8.69% reduction in forgetting compared to baseline methods. © 1993-2012 IEEE.

关键词： Fuzzy Cognitive Maps

来源：评论

学校读者我要写书评

暂无评论

CroPrompt: Cross-task Interactive Prompting for Zero-shot Spoken Language Understanding

CroPrompt: Cross-task Interactive Prompting for Zero-shot Sp...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Libo Qin Fuxuan Wei Qiguang Chen Jingxuan Zhou Shijue Huang Jiasheng Si Wenpeng Lu Wanxiang Che School of Computer Science and Engineering Central South University China Key Laboratory of Data Intelligence and Advanced Computing in Provincial Universities Soochow University China Research Center for Social Computing and Information Retrieval Harbin Institute of Technology China Harbin Institute of Technology Shenzhen China Key Laboratory of Computing Power Network and Information Security Ministry of Education Qilu University of Technology (Shandong Academy of Sciences)

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Slot filling and intent detection are two highly correlated tasks in spoken language understanding (SLU). Recent SLU research attempts to explore zero-shot prompting techniques in large language models to alleviate the data scarcity problem. Nevertheless, the existing prompting work ignores the cross-task interaction information for SLU, which leads to sub-optimal performance. To solve this problem, we present the pioneering work of Cross-task Interactive Prompting (CroPrompt) for SLU, which enables the model to interactively leverage the information exchange across the correlated tasks in SLU. Additionally, we further introduce a multi-task self-consistency mechanism to mitigate the error propagation caused by the intent information injection. We conduct extensive experiments on the standard SLU benchmark and the results reveal that CroPrompt consistently outperforms the existing prompting approaches. In addition, the multi-task self-consistency mechanism can effectively ease the error propagation issue, thereby enhancing the performance. We hope this work can inspire more research on cross-task prompting for SLU.

关键词： Large language models Signal processing Benchmark testing Multitasking Filling Acoustics Speech processing Information exchange Standards

来源：评论

学校读者我要写书评

暂无评论

Differential-Trust-Mechanism Based Trade-off Method Between Privacy and Accuracy in Recommender Systems

引用

IEEE Transactions on Information Forensics and Security 2025年 20卷 5054-5068页

作者： Xu, Guangquan Feng, Shicheng Xi, Hao Yan, Qingyang Li, Wenshan Wang, Cong Wang, Wei Liu, Shaoying Tian, Zhihong Zheng, Xi Qingdao Huanghai University School of Big Data Qingdao China Tianjin University College of Intelligence and Computing Tianjin300350 China KLISS and School of Software Beijing100084 China Sichuan University School of Cyber Science and Engineering Chengdu610207 China Xi’an Jiaotong University School of Cyber Science and Engineering Xi’an710049 China East China Normal University Shanghai200062 China Hiroshima University School of Informatics and Data Science Higashihiroshima739-8511 Japan Guangzhou University Cyberspace Institute of Advanced Technology Guangdong Key Laboratory of Industrial Control System Security Huangpu Research School of Guangzhou University China Macquarie University School of Computing SydneyNSW2109 Australia

In the era where Web3.0 values data security and privacy, adopting groundbreaking methods to enhance privacy in recommender systems is crucial. Recommender systems need to balance privacy and accuracy, while also having the ability to overcome cold start problems. The Differential Trust Mechanism (DTM) introduced in this paper is such an approach. The DTM provides a unique use of Gaussian distributions in modeling trust relationships within data, offering a novel way to balance recommendation accuracy with user privacy. This mechanism innovatively applies differential privacy principles, using Gaussian noise addition to protect individual user data from inference attacks, while maintaining the integrity and utility of the overall dataset. Unlike traditional anonymization techniques that often compromise data utility or vulnerability to reverse engineering, DTM provides a robust solution by dynamically adjusting privacy levels based on the trustworthiness of data requests. By combining DTM with existing mainstream recommendation algorithms, the prediction accuracy of MAE and RMSE increases by at least 6.60% and 2.69%, respectively. This dual benefit positions DTM as a significant advancement in secure data processing, especially relevant for online businesses and platforms where personalized recommendations are crucial yet privacy concerns are paramount. © 2005-2012 IEEE.

关键词： Sensitive data

来源：评论

学校读者我要写书评

暂无评论

Visual Class Incremental Learning with Textual Priors Guidance based on an Adapted Vision-Language Model

引用

IEEE Transactions on Multimedia 2025年

作者： Zhang, Wentao Yu, Tong Wang, Ruixuan Xie, Jianhui Trucco, Emanuele Zheng, Wei-Shi Yang, Xiaobo Sun Yat-sen Univerisity School of Computer Science and Engineering Guangzhou510275 China Peng Cheng Laboratory Shenzhen518066 China MOE Key Laboratory of Machine Intelligence and Advanced Computing Guangzhou510275 China Second Affiliated Hospital Guangzhou University of Chinese Medicine State Key Laboratory of Dampness Syndrome of Chinese Medicine Guangzhou510260 China Guangdong Provincial Key Laboratory of Clinical Research on Traditional Chinese Medicine Syndrome Guangzhou510120 China University of Dundee School of Science and Engineering DundeeDD1 4HN United Kingdom

An ideal artificial intelligence (AI) system should have the capability to continually learn like humans. However, when learning new knowledge, AI systems often suffer from catastrophic forgetting of old knowledge. Although many continual learning methods have been proposed, they often ignore the issue of misclassifying similar classes and make insufficient use of textual priors of visual classes to improve continual learning performance. In this study, we propose a continual learning framework based on a pre-trained vision-language model (VLM) that does not require storing old class data. This framework utilizes parameter-efficient fine-tuning of the VLM's text encoder for constructing a shared and consistent semantic textual space throughout the continual learning process. The textual priors of visual classes are encoded by the adapted VLM's text encoder to generate discriminative semantic representations, which are then used to guide the learning of visual classes. Additionally, fake out-of-distribution (OOD) images constructed from each training image further assist in the learning of visual classes. Extensive empirical evaluations on three natural datasets and one medical dataset demonstrate the superiority of the proposed framework. The source code is available at https://***/OpenMedIA/CIL_Adapterd_VLM. © 1999-2012 IEEE.

关键词： Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

Augmenting Short Enrollment Speech via Synthesis for Target Speaker Extraction

Augmenting Short Enrollment Speech via Synthesis for Target ...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zikang Huang Jingru Lin Meng Ge Yu Jiang Xiaobao Wang Longbiao Wang Jianwu Dang Tianjin Key Laboratory of Cognitive Computing and Application College of Intelligence and Computing Tianjin University Tianjin China Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen China Department of Electrical and Computer and Engineering National University of Singapore Huiyan Technology (Tianjin) Co. Ltd. Tianjin China Shenzhen Institute of Advanced Technology Chinese Academy of Sciences Shenzhen China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

A high-quality enrollment speech is crucial to target speaker extraction (TSE), since it provides essential cues for identifying the target speaker in the mixture. However, real applications usually only permit a short enrollment speech, e.g. a wakeup word for a mobile device, that provides limited cues. To address this issue, we propose an enrollment augmentation strategy that allows us to enrich the limited enrollment speech with massive text data through speech synthesis. By doing so, the extended enrollment speech contains enhanced speaker timbre and phonetic content which leads to better extraction quality. Furthermore, we propose a training data augmentation strategy to improve the model’s robustness and generalization in short enrollment speech scenarios. Experiments on Libri2Mix demonstrate that our proposed strategies bring a significant improvement in extreme scenarios where only 0.5s and 1-word enrollment speech is provided. We also release our code at https://***/HuangZikang-TJU/Aug4TSE.

关键词： Training data Speech enhancement Phonetics Signal processing Speech Robustness Mobile handsets Data models Timbre Speech synthesis

来源：评论

学校读者我要写书评

暂无评论

Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

arXiv

引用

arXiv 2025年

作者： Huang, Guohong Zeng, Ling-An Zheng, Zexin Gu, Shengbo Zheng, Wei-Shi Sun Yat-sen University China Key Laboratory of Machine Intelligence and Advanced Computing Ministry of Education China

We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5% of the inference time. Code is available here. Copyright © 2025, The Authors. All rights reserved.

关键词： Geometry

来源：评论

学校读者我要写书评

暂无评论

DDNet: Deformable Convolution and Dense FPN for Surface Defect Detection in Recycled Books

DDNet: Deformable Convolution and Dense FPN for Surface Defe...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Jun Yu WenJian Wang School of Computer and Information Technology (School of Big Data) Shanxi University Taiyuan China Taihang Laboratory In Shanxi Province (Advanced Computing Laboratory In Shanxi Province) Taiyuan China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education Shanxi University Taiyuan China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Recycled and recirculated books, such as ancient texts and reused textbooks, hold significant value in the secondhand goods market, with their worth largely dependent on surface preservation. However, accurately assessing surface defects is challenging due to the wide variations in shape, size, and the often imprecise detection of defects. To address these issues, we propose DDNet, an innovative detection model designed to enhance defect localization and classification. DDNet introduces a surface defect feature extraction module based on a deformable convolution operator (DC) and a densely connected FPN module (DFPN). The DC module dynamically adjusts the convolution grid to better align with object contours, capturing subtle shape variations and improving boundary delineation and prediction accuracy. Meanwhile, DFPN leverages dense skip connections to enhance feature fusion, constructing a hierarchical structure that generates multi-resolution, high-fidelity feature maps, thus effectively detecting defects of various sizes. In addition to the model, we present a comprehensive dataset specifically curated for surface defect detection in recycled and recirculated books. This dataset encompasses a diverse range of defect types, shapes, and sizes, making it ideal for evaluating the robustness and effectiveness of defect detection models. Through extensive evaluations, DDNet achieves precise localization and classification of surface defects, recording a mAP value of 46.7% on our proprietary dataset—an improvement of 14.2% over the baseline model—demonstrating its superior detection capabilities.

关键词： Location awareness Accuracy Convolution Shape Semisupervised learning Feature extraction Robustness Speech processing Surface treatment Defect detection

来源：评论

学校读者我要写书评

暂无评论

ThicknessVAE: Learning a Lateral Prior for Clothed Human Body Reconstruction

ThicknessVAE: Learning a Lateral Prior for Clothed Human Bod...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xiaotao Wu Zhaoxin Fan Huiguang He Dinggang Shen School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices ShanghaiTech University Shanghai China NeuBCI Group State Key Laboratory of Brain Cognition and Brain-Inspired Intelligence Technology Institute of Automation Chinese Academy of Sciences Beijing China Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing Institute of Artificial Intelligence Beihang University Beijing China Beijing Academy of Blockchain and Edge Computing China Shanghai United Imaging Intelligence Co. Ltd. Shanghai China Shanghai Clinical Research and Trial Center Shanghai China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Sandwich-like structures have shown remarkable efficacy in clothed human reconstruction. However, these approaches often generate unrealistic side geometries due to inadequate handling of lateral regions. This paper addresses this limitation by incorporating the side geometry of clothed humans as a prior. We propose ThicknessVAE, a novel two-stage method that makes two key contributions: (1) We learn a prototype from point clouds for the lateral regions of clothed humans to extract common and detailed geometric features. (2) We utilize this prototype as a prior to transform geometric features into a thickness map associated with clothed human images, enabling refined normal integration for sandwich-like reconstruction methods. By seamlessly integrating our model into the sandwich-like reconstruction pipeline, we achieve highly realistic side views. Both qualitative and quantitative experiments demonstrate that our approach is comparable to state-of-the-art methods in terms of side-view realism.

关键词： Geometry Surface reconstruction Three-dimensional displays Pipelines Prototypes Transforms Feature extraction Speech processing Image reconstruction Surface treatment

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：