The video grounding(VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in the complex interaction between video ...
详细信息
The video grounding(VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in the complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely video grounding transformer(ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows.(1) The token is unrelated to the video or the query and avoids data bias toward the original video and query.(2) The token simultaneously performs global context aggregation from video and query ***, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention(i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets:ANet-Captions, TACoS, and YouCookⅡ. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.
Knee osteoarthritis (KOA) is a widespread global condition, impacting over 300 million individuals as per the World Health Organization (WHO). Particularly prevalent among older adults, knee OA is a prominent cause of...
详细信息
Document-level relation extraction aims at extracting relational facts between two entities in a document. Existing approaches mainly focus on target entities, utilizing techniques such as graph neural networks to enh...
详细信息
Generating realistic handwritten word images that closely resemble a target style remains a challenging task in document image analysis. In recent years, deep learning techniques, such as Latent Diffusion Models (LDM)...
详细信息
data-driven business models imply the inter-organisational exchange of data or similar value objects. datascience methods enable organisations to discover patterns and eventually knowledge from data. Further, by trai...
详细信息
The utilization of Artificial Intelligence in automatically generating radiology reports presents a promising solution for enhancing the efficiency of the diagnostic process and reducing human error. However, existing...
详细信息
data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data *** clustering algorithms,such as K-means,are widely used due to their simplicity and *** p...
详细信息
data clustering is an essential technique for analyzing complex datasets and continues to be a central research topic in data *** clustering algorithms,such as K-means,are widely used due to their simplicity and *** paper proposes a novel Spiral Mechanism-Optimized Phasmatodea Population Evolution Algorithm(SPPE)to improve clustering *** SPPE algorithm introduces several enhancements to the standard Phasmatodea Population Evolution(PPE)***,a Variable Neighborhood Search(VNS)factor is incorporated to strengthen the local search capability and foster population ***,a position update model,incorporating a spiral mechanism,is designed to improve the algorithm’s global exploration and convergence ***,a dynamic balancing factor,guided by fitness values,adjusts the search process to balance exploration and exploitation *** performance of SPPE is first validated on CEC2013 benchmark functions,where it demonstrates excellent convergence speed and superior optimization results compared to several state-of-the-art metaheuristic *** further verify its practical applicability,SPPE is combined with the K-means algorithm for data clustering and tested on seven *** results show that SPPE-K-means improves clustering accuracy,reduces dependency on initialization,and outperforms other clustering *** study highlights SPPE’s robustness and efficiency in solving both optimization and clustering challenges,making it a promising tool for complex data analysis tasks.
We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regressio...
Vision-language models (VLMs) have emerged as formidable tools, showing their strong capability in handling various open-vocabulary tasks in image recognition, text-driven visual content generation, and visual chatbot...
详细信息
Vision-language models (VLMs) have emerged as formidable tools, showing their strong capability in handling various open-vocabulary tasks in image recognition, text-driven visual content generation, and visual chatbots, to name a few. In recent years, considerable efforts and resources have been devoted to adaptation methods for improving the downstream performance of VLMs, particularly on parameter-efficient fine-tuning methods like prompt learning. However, a crucial aspect that has been largely overlooked is the confidence calibration problem in fine-tuned VLMs, which could greatly reduce reliability when deploying such models in the real world. This paper bridges the gap by systematically investigating the confidence calibration problem in the context of prompt learning and reveals that existing calibration methods are insufficient to address the problem, especially in the open-vocabulary setting. To solve the problem, we present a simple and effective approach called Distance-Aware Calibration (DAC), which is based on scaling the temperature using as guidance the distance between predicted text labels and base classes. The experiments with 7 distinct prompt learning methods applied across 11 diverse downstream datasets demonstrate the effectiveness of DAC, which achieves high efficacy without sacrificing the inference speed. Our code is available at https://***/mlstat-Sustech/CLIP Calibration. Copyright 2024 by the author(s)
Alzheimer’s disease is a neurological disorder characterized by functional and structural atrophy, leading to symptoms like memory loss and cognitive decline. This study seeks to analyze the disruptions of functional...
详细信息
暂无评论