Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQ...
详细信息
Graph-based methods, pivotal for label inference over interconnected objects in many real-world applications, often encounter generalization challenges, if the graph used for model training differs significantly from ...
详细信息
Graph-based methods, pivotal for label inference over interconnected objects in many real-world applications, often encounter generalization challenges, if the graph used for model training differs significantly from the graph used for testing. This work delves into Graph Domain Adaptation (GDA) to address the unique complexities of distribution shifts over graph data, where interconnected data points experience shifts in features, labels, and in particular, connecting patterns. We propose a novel, theoretically principled method, Pairwise Alignment (Pair-Align) to counter graph structure shift by mitigating conditional structure shift (CSS) and label shift (LS). Pair-Align uses edge weights to recalibrate the influence among neighboring nodes to handle CSS and adjusts the classification loss with label weights to handle LS. Our method demonstrates superior performance in real-world applications, including node classification with region shift in social networks, and the pileup mitigation task in particle colliding experiments. For the first application, we also curate the largest dataset by far for GDA studies. Our method shows strong performance in synthetic and other existing benchmark datasets. Copyright 2024 by the author(s)
Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of computer Vision(CV)and Natural Language processing(NLP)for generating the image *** use in s...
详细信息
Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of computer Vision(CV)and Natural Language processing(NLP)for generating the image *** use in several application areas namely recommendation in editing applications,utilization in virtual assistance,*** development of NLP and deep learning(DL)modelsfind useful to derive a bridge among the visual details and textual *** this view,this paper introduces an Oppositional Harris Hawks Optimization with Deep Learning based Image Captioning(OHHO-DLIC)*** OHHO-DLIC technique involves the design of distinct levels of ***,the feature extraction of the images is carried out by the use of EfficientNet ***,the image captioning is performed by bidirectional long short term memory(BiLSTM)model,comprising encoder as well as *** last,the oppositional Harris Hawks optimization(OHHO)based hyperparameter tuning process is performed for effectively adjusting the hyperparameter of the EfficientNet and BiLSTM *** experimental analysis of the OHHO-DLIC technique is carried out on the Flickr 8k dataset and a comprehensive comparative analysis highlighted the better performance over the recent approaches.
Medical report generation is crucial for clinical diagnosis and patient management, summarizing diagnoses and recommendations based on medical imaging. However, existing work often overlook the clinical pipeline invol...
详细信息
Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment fro...
Knowledge Graphs (KGs) are potent frameworks for knowledge representation and reasoning. Nevertheless, KGs are inherently incomplete, leaving numerous uncharted relationships and facts awaiting discovery. Deep learnin...
详细信息
Knowledge Graphs (KGs) are potent frameworks for knowledge representation and reasoning. Nevertheless, KGs are inherently incomplete, leaving numerous uncharted relationships and facts awaiting discovery. Deep learning methodologies have proven effective in enhancing KG completion by framing it as a link prediction task, where the goal is to discern the validity of a triple comprising a head, relation, and tail. The significance of structural information in assessing the validity of a triple within a KG is well-established. However, quantifying this structural information poses a challenge. We need to pinpoint the metric that encapsulates the structural information of a triple and smoothly incorporate this metric into the link prediction learning process. In this study, we recognize the critical importance of the intersection among the k-hop neighborhoods of the head, relation, and tail when determining the validity of a triple. To address this, we introduce a novel randomized algorithm designed to efficiently generate intersection features for candidate triples. Our experimental results demonstrate that a straightforward fully-connected network leveraging these intersection features can surpass the performance of established KG embedding models and even outperform graph neural network baselines. Additionally, we highlight the substantial training time efficiency gains achieved by our network trained on intersection features. Copyright 2024 by the author(s)
Deep learning-based hyperspectral image (HSI) compression has recently attracted great attention in remote sensing due to the growth of hyperspectral data archives. Most of the existing models achieve either spectral ...
详细信息
Worldwide, women are compressed by cervical cancer, which is a prevalent malignancy. This disease, which is currently the fourth leading cause of death for women, shows no symptoms when it first arises. Cells that cau...
详细信息
Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model compl...
详细信息
Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input *** alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time *** aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking *** tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking *** experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.
The maximal coding rate reduction (MCR2) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and...
详细信息
The maximal coding rate reduction (MCR2) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and highly effective deep network architectures. However, it lacks a complete theoretical justification: only the properties of its global optima are known, and its global landscape has not been studied. In this work, we give a complete characterization of the properties of all its local and global optima, as well as other types of critical points. Specifically, we show that each (local or global) maximizer of the MCR2 problem corresponds to a low-dimensional, discriminative, and diverse representation, and furthermore, each critical point of the objective is either a local maximizer or a strict saddle point. Such a favorable landscape makes MCR2 a natural choice of objective for learning diverse and discriminative representations via first-order optimization methods. To validate our theoretical findings, we conduct extensive experiments on both synthetic and real data sets. Copyright 2024 by the author(s)
暂无评论