Audio-visual speech recognition (AVSR) aims to enhance the robustness of an automatic speech recognition (ASR) systems by incorporating visual information from lip movements, especially in challenging noisy environmen...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Audio-visual speech recognition (AVSR) aims to enhance the robustness of an automatic speech recognition (ASR) systems by incorporating visual information from lip movements, especially in challenging noisy environments. Nevertheless, most current approaches either involve training from scratch or fully finetuning a pre-trained model, both of which incur significant computational costs and are often impractical for large-scale speech foundation models. This gap highlights the need for more efficient methods to leverage visual and acoustic information in AVSR tasks. To address this challenge, we propose AVWhisper, a parameter-efficient model that integrates visual and acoustic representations by injecting visual features from the AV-HuBERT encoder into the pre-trained Whisper model. Our approach leverages the existing attention mechanisms in Whisper to facilitate cross-modal interaction and integrates auxiliary visual information through lightweight adapters based on Low-Rank Adaptation (LoRA) and prompt-based techniques. Furthermore, a two-phase training strategy is adopted to effectively handle cross-domain differences and visual information injection problems respectively. Extensive experiments on the LRS3-TED dataset demonstrate that AVWhisper consistently outperforms state-of-the-art methods across various noise conditions, offering a more efficient and scalable solution for audio-visual speech recognition.
This paper aims to develop a novel robust multi-dialect end-to-end ASR system with beam search threshold pruning. The efficacy of our proposed model is evaluated using word error rate (WER). Our key contributions are:...
详细信息
Parkinson’s disease (PD) is a debilitating neurodegenerative disorder affecting millions worldwide. Early detection is vital for effective management, yet remains challenging. In this study, we investigated four dist...
详细信息
Gene Co-Expression Network (GCN) Analysis is fundamental for understanding gene-gene interactions and cellular processes. A co-expressed gene pair may exhibit patterns such as absolute, alternate, shifting, scaling, a...
详细信息
ISBN:
(数字)9798331523893
ISBN:
(纸本)9798331523909
Gene Co-Expression Network (GCN) Analysis is fundamental for understanding gene-gene interactions and cellular processes. A co-expressed gene pair may exhibit patterns such as absolute, alternate, shifting, scaling, and shifting-scaling. To identify co-expressed genes among several genes, patterns must be identified first. Among the existing similarity measures, LPCM is one which is robust to noise while analyzing gene expression data. However, challenges remain in capturing subtle co-expression patterns in highly noisy datasets. In this paper, we propose an enhancement to LPCM by introducing angular deviation-based transformations. This modified measure further reduces noise sensitivity and improves the detection of co-expression patterns. Experiments demonstrate that the proposed measure consistently outperforms traditional approaches under varying noise conditions.
Hate Speech can be referred as any type of communication that can degrade, discriminates against or prejudice or incites violence against groups or individual based on certain factors such as religion, race, nationali...
详细信息
Hate Speech can be referred as any type of communication that can degrade, discriminates against or prejudice or incites violence against groups or individual based on certain factors such as religion, race, nationality, skin color, gender etc. It is very crucial to detect hate speech to stop the harm or violence against targeted individuals or groups and to create safe and inclusive environment. In this paper, the performance of two large language model-based approaches were investigated. In the first approach, fine-tuning of GPT-2 model was performed using a hate-speech dataset and then evaluated the fine-tuned GPT model for hate speech detection. In the second approach, n-shot learning based approaches were used for value of n as zero, one and two, where prompt designing was done first and then ask the GPT model to detect if the given text is expressing hate based on the given prompt on test data. All the experiments were carried out on publicly available ‘HatEval’ dataset. Experimental results show that few(n) shot learning does not necessarily surpass lesser(
Marine Saliency Segmentation (MSS) plays a pivotal role in various vision-based marine exploration tasks. However, existing marine segmentation techniques face the dilemma of object mislocalization and imprecise bound...
详细信息
Existing image inpainting methods face limitations in detail restoration. Although transformer-based models have made certain progress recently, the lack of hierarchical feature interaction and insufficient considerat...
详细信息
Information exists in various forms in the real world, and the effective interaction and fusion of multimodal information plays a key role in the research of computer vision and deep learning. Generating an image that...
详细信息
Alzheimer’s disease is the most prevalent cause of dementia, and its early diagnosis is crucial to prevent the progression to severe stages where cognitive abilities are severely impaired. This research paper present...
详细信息
Alzheimer’s disease is the most prevalent cause of dementia, and its early diagnosis is crucial to prevent the progression to severe stages where cognitive abilities are severely impaired. This research paper presents an innovative approach to predict the severity of dementia through classification and grading. The research introduces an innovative adaptation of the DEMNET model, referred to as the DEMENtia network model. The research implements a novel methodology leveraging Convolutional Neural Networks (CNNs) to identify significant patterns within unorganized web-based data collections. The investigation employs a dataset com- prising four categories, obtained from the Kaggle platform. The developed model demonstrates exceptional performance, achieving 99.9% accuracy during training, 97.4% accuracy in testing, and an overall precision of 0.975. The DEMENtia network model suc- cessfully categorizes individuals into four groups: those without dementia, and those with moderate, mild, or very mild dementia. The model achieves a remarkable accuracy of 99.20% in classifying the Moderate demented class, a significant advantage over existing approaches. To understand this behavior, conducted an in-depth analysis by visualizing the pixel intensity distribution over the space. The proposed model validity has been confirmed through validation by a team of neurologists, ensuring its potential for real-world clinical applications. By accurately predicting dementia severity, the proposed model can aid in early diagnosis and treatment planning, contributing to improved patient care and management.
In previous work, we introduced a structured illumination strategy using linear gratings to achieve sub-nanometer misalignment sensing, which significantly enhanced accuracy and sensitivity. However, the approach was ...
详细信息
暂无评论