检索结果-内蒙古大学图书馆

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zhao Yang Yue Heng Yeo Rui Jiang Xiao Fu Weiguang Chen Wei Xi Jizhong Zhao School of Computer Science and Technology Xi’an Jiaotong University China School of Computer Science and Engineering Nanyang Technological University Singapore College of Computer Science and Electronic Engineering Hunan University China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Audio-visual speech recognition (AVSR) aims to enhance the robustness of an automatic speech recognition (ASR) systems by incorporating visual information from lip movements, especially in challenging noisy environments. Nevertheless, most current approaches either involve training from scratch or fully finetuning a pre-trained model, both of which incur significant computational costs and are often impractical for large-scale speech foundation models. This gap highlights the need for more efficient methods to leverage visual and acoustic information in AVSR tasks. To address this challenge, we propose AVWhisper, a parameter-efficient model that integrates visual and acoustic representations by injecting visual features from the AV-HuBERT encoder into the pre-trained Whisper model. Our approach leverages the existing attention mechanisms in Whisper to facilitate cross-modal interaction and integrates auxiliary visual information through lightweight adapters based on Low-Rank Adaptation (LoRA) and prompt-based techniques. Furthermore, a two-phase training strategy is adopted to effectively handle cross-domain differences and visual information injection problems respectively. Extensive experiments on the LRS3-TED dataset demonstrate that AVWhisper consistently outperforms state-of-the-art methods across various noise conditions, offering a more efficient and scalable solution for audio-visual speech recognition.

关键词： Training Visualization Adaptation models Foundation models Lips Noise Acoustics Robustness Noise measurement Speech processing

来源：评论

学校读者我要写书评

暂无评论

Robust Multi-Dialect End-to-End ASR Model Jointly with Beam Search Threshold Pruning and LLM

引用

SN computer science 2025年第4期6卷 1-12页

作者： Shunmuga Priya, M.C. Karthika Renuka, D. Ashok Kumar, L. Department of Computer Science and Engineering Amrita School of Computing Amrita Vishwa Vidyapeetham Coimbatore India Department of Information Technology PSG College of Technology Coimbatore India Department of Electrical and Electronics Engineering Thiagarajar College of Engineering Madurai India

This paper aims to develop a novel robust multi-dialect end-to-end ASR system with beam search threshold pruning. The efficacy of our proposed model is evaluated using word error rate (WER). Our key contributions are: (1) To develop an end-to-end ASR system using attention-based neural network architecture and analyze the effectiveness of two features such as MFCC and log mel filter bank energies on multiple speech dialect corpora including American, Britain, and Indian accents;(2) To integrate beam search threshold pruning as a decoding mechanism to reduce the decoding time (3) To conduct an experimental analysis to test the model performance and compare the results against baseline system. (4) Post processing analysis are carried out using Llama2-7B based large language model(LLM) for enhancing the performance of proposed ASR system. The proposed model significantly improves performance by 1.91% and 4.29% over clean and noisy speech in librispeech corpus. Similarly, for the Indian accented speech, the model attains an average WER of about 6.6%. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2025.

关键词： Automatic speech recognition Beam search Decoding LLM Log Mel filter bank energies

来源：评论

学校读者我要写书评

暂无评论

Comprehensive multimodal approach for Parkinson’s disease classification using artificial intelligence: insights and model explainability

引用

Soft Computing 2025年第3期29卷 1845-1877页

作者： Balaha, Hossam Magdy Hassan, Asmaa El-Sayed Ahmed, Rawan Ayman Balaha, Magdy Hassan Bioengineering Department J.B. Speed School of Engineering University of Louisville LouisvilleKY United States Computer Science and Control Systems Engineering Department Faculty of Engineering Mansoura University Mansoura Egypt Mathematics and Engineering Physics Department Faculty of Engineering Mansoura University Mansoura Egypt Department of Obstetrics and Gynecology Faculty of Medicine Tanta University Tanta Egypt

Parkinson’s disease (PD) is a debilitating neurodegenerative disorder affecting millions worldwide. Early detection is vital for effective management, yet remains challenging. In this study, we investigated four distinct datasets for PD detection. Through comprehensive experimentation employing ensemble methods and feature selection, we achieved high classification accuracies across the datasets. For the Oxford Parkinson’s Disease Detection Dataset, an accuracy of 95.67%, precision of 97.59%, recall of 84.5%, specificity of 99.32%, and F1-score of 90.57% were achieved. For the Alzheimer Parkinson Diseases 3 Class Dataset, the "Stacking" approach surpasses individual models, reaching an accuracy of 99.85%, precision of 99.81%, recall of 99.81%, specificity of 99.86%, and F1 of 99.81%. For the NewHandPD dataset, Regarding the Spiral category, The "Base-P32-384" model surpasses others with an accuracy of 97.35%, precision of 96.50%, recall of 98.57%, and F1-score of 97.53%. The collective "Stacking" approach proves highly effective regarding the Circle category, achieving 100% across all performance metrics. Regarding the Meander category, the "Base-P16-224" model achieves an accuracy of 97.35%, precision of 99.26%, recall of 95.71%, specificity of 99.19%, and F1 of 97.45%. The Mobile Device Voice Recordings at King’s College London (MDVR-KCL) dataset contains two datasets. Regarding the "SpontaneousDialogue" dataset, accuracy, BAC, precision, recall, specificity, and F1-score were computed, resulting in values of 94.03%, 92.83%, 90.78%, 100.0%, and 85.67%, respectively. Regarding the "ReadText" dataset, accuracy, BAC, precision, recall, specificity, and F1-score were computed, resulting in values of 91.89%, 90.62%, 87.5%, 100.0%, and 81.25%, respectively. Our findings highlight the efficacy of leveraging diverse data sources and advanced machine learning techniques to enhance PD detection accuracy. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germa

关键词： Neurodegenerative diseases

来源：评论

学校读者我要写书评

暂无评论

AngLPCM: An Enhanced Similarity Measure

AngLPCM: An Enhanced Similarity Measure

引用

International Symposium on Advanced Computing and Communication (ISACC)

作者： Pallabi Patowary Dhruba K Bhattacharyya Department of Computer Science & Engineering Golaghat Engineering College Golaghat Assam India Department of Computer Science and Engineering Tezpur University Assam India

ISBN: (数字)9798331523893

ISBN: (纸本)9798331523909

Gene Co-Expression Network (GCN) Analysis is fundamental for understanding gene-gene interactions and cellular processes. A co-expressed gene pair may exhibit patterns such as absolute, alternate, shifting, scaling, and shifting-scaling. To identify co-expressed genes among several genes, patterns must be identified first. Among the existing similarity measures, LPCM is one which is robust to noise while analyzing gene expression data. However, challenges remain in capturing subtle co-expression patterns in highly noisy datasets. In this paper, we propose an enhancement to LPCM by introducing angular deviation-based transformations. This modified measure further reduces noise sensitivity and improves the detection of co-expression patterns. Experiments demonstrate that the proposed measure consistently outperforms traditional approaches under varying noise conditions.

关键词： Sensitivity Correlation Noise Noise reduction Noise measurement Gene expression Intelligent systems

来源：评论

学校读者我要写书评

暂无评论

Hate Speech Detection: Leveraging LLM-GPT2 with Fine-Tuning and Multi-Shot Techniques

引用

Procedia computer science 2025年 258卷 2817-2825页

作者： Mahima Choudhary Basant Agarwal Vishnu Goyal Department of Computer Science and Engineering Central University of Rajasthan Kishangarh(Ajmer)-305817 India Computer Engineering Dept. Government Mahila Polytechnic College Sanganer

Hate Speech can be referred as any type of communication that can degrade, discriminates against or prejudice or incites violence against groups or individual based on certain factors such as religion, race, nationality, skin color, gender etc. It is very crucial to detect hate speech to stop the harm or violence against targeted individuals or groups and to create safe and inclusive environment. In this paper, the performance of two large language model-based approaches were investigated. In the first approach, fine-tuning of GPT-2 model was performed using a hate-speech dataset and then evaluated the fine-tuned GPT model for hate speech detection. In the second approach, n-shot learning based approaches were used for value of n as zero, one and two, where prompt designing was done first and then ask the GPT model to detect if the given text is expressing hate based on the given prompt on test data. All the experiments were carried out on publicly available ‘HatEval’ dataset. Experimental results show that few(n) shot learning does not necessarily surpass lesser(

关键词： Hate-Speech LLM GPT2 Fine-Tuning Multiple Shot Technique

来源：评论

学校读者我要写书评

暂无评论

Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation

arXiv

引用

arXiv 2025年

作者： Chang, Laibin Wang, Yunke Huang, JiaXing Deng, Longxiang Du, Bo Xu, Chang School of Computer Science Wuhan University China School of Computer Science The University of Sydney Australia School of Computer Science and Engineering Nanyang Technological University Singapore

Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise boundaries due to the complex underwater environment. Meanwhile, despite the impressive performance of diffusion models in visual segmentation, there remains potential to further leverage contextual semantics to enhance feature learning of region-level salient objects, thereby improving segmentation outcomes. Building on this insight, we propose DiffMSS, a novel marine saliency segmenter based on the diffusion model, which utilizes semantic knowledge distillation to guide the segmentation of marine salient objects. Specifically, we design a region-word similarity matching mechanism to identify salient terms at the word level from the text descriptions. These high-level semantic features guide the conditional feature learning network in generating salient and accurate diffusion conditions with semantic knowledge distillation. To further refine the segmentation of fine-grained structures in unique marine organisms, we develop the dedicated consensus deterministic sampling to suppress overconfident missegmentations. Comprehensive experiments demonstrate the superior performance of DiffMSS over state-of-the-art methods in both quantitative and qualitative evaluations. © 2025, CC BY.

关键词： Semantic Segmentation

来源：评论

学校读者我要写书评

暂无评论

Efficient Hierarchical Feature Collaboration Transformer for Image Inpainting

引用

IEEE Transactions on Multimedia 2025年

作者： Zhang, Dengyong Fu, Nuo Liao, Xin Chen, Jiaxin Yang, Hengfu Yang, Gaobo Changsha University of Science and Technology School of Computer and Communication Engineering Changsha410004 China Hunan University College of Computer Science and Electronic Engineering Changsha410082 China Hunan First Normal University School of Computer Science Changsha410205 China

Existing image inpainting methods face limitations in detail restoration. Although transformer-based models have made certain progress recently, the lack of hierarchical feature interaction and insufficient consideration of the importance of features at different network levels lead to semantic ambiguity in image reconstruction. To enhance the visual quality and accuracy of image inpainting, we adopt a multi-level feature fusion approach and propose a novel, efficient hierarchical feature collaboration transformer (HFCT). Our approach comprises two modules: dual stream gated feature fusion (DSGF) and region-separated attention module (RSAM), effectively capturing features at different levels of the network and enhancing inter-level information exchange. The DSGF module uses soft gating to fuse primary and advanced features, strengthening the connection from local to global consistency and reducing artifacts. The RSAM module resolves attention isolation issues in feature fusion through region-separated attention, strengthening the understanding of feature relationships, capturing more image semantics, and improving restoration accuracy. Extensive experiments on the Paris StreetView, CelebA-HQ, and Places2 benchmark datasets demonstrate that our proposed method achieves superior image inpainting quality compared to several state-of-the-art inpainting algorithms. Please refer to the project page: https://***/csfunuo/HFCT. © 2024 IEEE.

关键词： Restoration

来源：评论

学校读者我要写书评

暂无评论

A review of multimodal learning for text to images

引用

Multimedia Tools and Applications 2025年第10期84卷 8205-8245页

作者： Chen, Wei Yang, Yuqing Tian, Zijian Chen, Qiteng Liu, Jueting School of Computer Science & Technology China University of Mining and Technology Xuzhou221116 China Engineering Research Center of Mine Digitization Ministry of Education China University of Mining and Technology Xuzhou221116 China Beijing100083 China

Information exists in various forms in the real world, and the effective interaction and fusion of multimodal information plays a key role in the research of computer vision and deep learning. Generating an image that matches a given text description is one of the multimodal tasks that requires a strong generative model and cross-modal understanding. This paper provided a comprehensive analysis of recent advances in text-generated images and a taxonomy based on model architecture and characteristics. We introduced the classification of text generated image based on different frames, including text generated image method based on generation adversarial network, transformer and diffusion model. This paper introduced the network structure, advantages and disadvantages of each method, the benchmark data set and corresponding evaluation index, and summarized the application progress and experimental results according to different classification methods. Finally, we provided insights into current research challenges and possible future research directions and applications. © The Author(s), under exclusive licence to Springer science+Business Media, LLC, part of Springer Nature 2024.

关键词： Generative adversarial networks

来源：评论

学校读者我要写书评

暂无评论

Unraveling Dementia Severity: A Deep Learning Approach for Brain MR Image-Based Prediction and Misclassification Analysis

引用

Procedia computer science 2025年 258卷 1199-1208页

作者： Rekha R Nair Tina Babu Tripty Singh Department of Computer Science and Engineering Alliance School of Advanced Computing Alliance University Bengaluru India Department of Computer Science and Engineering Amrita School of Computing Bengaluru Amrita Vishwa Vidyapeetham India

Alzheimer’s disease is the most prevalent cause of dementia, and its early diagnosis is crucial to prevent the progression to severe stages where cognitive abilities are severely impaired. This research paper presents an innovative approach to predict the severity of dementia through classification and grading. The research introduces an innovative adaptation of the DEMNET model, referred to as the DEMENtia network model. The research implements a novel methodology leveraging Convolutional Neural Networks (CNNs) to identify significant patterns within unorganized web-based data collections. The investigation employs a dataset com- prising four categories, obtained from the Kaggle platform. The developed model demonstrates exceptional performance, achieving 99.9% accuracy during training, 97.4% accuracy in testing, and an overall precision of 0.975. The DEMENtia network model suc- cessfully categorizes individuals into four groups: those without dementia, and those with moderate, mild, or very mild dementia. The model achieves a remarkable accuracy of 99.20% in classifying the Moderate demented class, a significant advantage over existing approaches. To understand this behavior, conducted an in-depth analysis by visualizing the pixel intensity distribution over the space. The proposed model validity has been confirmed through validation by a team of neurologists, ensuring its potential for real-world clinical applications. By accurately predicting dementia severity, the proposed model can aid in early diagnosis and treatment planning, contributing to improved patient care and management.

关键词： Alzheimer’s disease Dementia DEMENtia network model Convolutional Neural Network (CNN)

来源：评论

学校读者我要写书评

暂无评论

Sub-1 nm misalignment sensing with cascaded interference in polar coordinate for lithography

引用

Optics Express 2025年第1期33卷 189-198页

作者： Wang, Nan Tang, Baiyang Jiang, Wei School of Physical Science and Technology Southwest Jiaotong University Chengdu611756 China Department of Electronic & Computer Engineering The Hong Kong University of Science and Technology Clear Water Bay Hong Kong China School of Intelligent Manufacturing Chengdu Technological University Chengdu611730 China

In previous work, we introduced a structured illumination strategy using linear gratings to achieve sub-nanometer misalignment sensing, which significantly enhanced accuracy and sensitivity. However, the approach was limited to linear gratings, as maintaining consistent fringe patterns during interference and modulation is essential for precise alignment. To overcome this limitation, we propose qhat we believe to be a novel misalignment sensing method based on cascaded interference in polar coordinates, enabling the use of sub-wavelength circular gratings for sub-nanometer alignment. Experimental results demonstrate that this method achieves 2D misalignment measurement with an impressive accuracy of 0.62 nm across a 50 µm range, providing a robust solution for wide-range, high-precision 2D alignment. © 2025 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement.

关键词： Optical depth

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：