Withthe development of Artificial Intelligence Generated Content (AIGC), fake image detection has become increasingly challenging. Also leveraging the advanced capabilities of large language models (LLMs) in sequence...
详细信息
ISBN:
(纸本)9789819786848;9789819786855
Withthe development of Artificial Intelligence Generated Content (AIGC), fake image detection has become increasingly challenging. Also leveraging the advanced capabilities of large language models (LLMs) in sequence prediction, we propose a novel perspective on fake image detection by fine-tuning pure LLMs. We introduce Fake-GPT, a LLM with7 billion parameters which can differentiate between real and fake images. Unlike conventional image processing models, our approach directly process RGB pixel values without relying on any position embedding and visual-language feature alignment, thereby reducing model complexity and processing steps. Our research demonstrates the effective application of LLMs in detecting fake images, thereby expanding their application in non-textual domains. Extensive experiments conducted on various deepfake datasets show that Fake-GPT achieves competitive results compared with conventional image processing models, underscoring its potential as a new paradigm in the realm of image authentication.
vision Mamba (VMamba) has recently attracted great research attention due to its ability to obtain a global receptive field with linear computational complexity. However, similar to vision Transformer (ViT), due to it...
详细信息
ISBN:
(纸本)9789819785049;9789819785056
vision Mamba (VMamba) has recently attracted great research attention due to its ability to obtain a global receptive field with linear computational complexity. However, similar to vision Transformer (ViT), due to its mechanism of dividing patches, it also faces the issue of insufficient description ability of local details. To address this issue, we design in this paper a dual-stream network that combines VMamba and CNN, aiming to enable the network to possess boththe global receptive field of VMamba and the local detail description capability of CNN. Both of the two characteristics are crucial for remote sensing image semantic segmentation. the two streams are supervised and trained through independent loss functions. On the other hand, to enable sufficient information exchange between the two branches, we introduce an auto-scaling fusion module aiming at bridging the semantic gap between VMamba and CNN. Experiments demonstrate that the method proposed in this paper outperforms state-of-the-art methods on multiple remote sensing semantic segmentation datasets.
Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehe...
详细信息
ISBN:
(纸本)9789819786190;9789819786206
Currently, large vision-language models have gained promising progress on many downstream tasks. However, they still suffer many challenges in fine-grained visual understanding tasks, such as object attribute comprehension. Besides, there have been growing efforts on the evaluations of large vision-language models, but lack of in-depth study of attribute comprehension and the visual language fine-tuning process. In this paper, we propose to evaluate the attribute comprehension ability of large vision-language models from two perspectives: attribute recognition and attribute hierarchy understanding. We evaluate three vision-language interactions, including visual question answering, image-text matching, and image-text cosine similarity. Furthermore, we explore the factors affecting attribute comprehension during fine-tuning. through a series of quantitative and qualitative experiments, we introduce three main findings: (1) Large vision-language models possess good attribute recognition ability, but their hierarchical understanding ability is relatively limited. (2) Compared to ITC, ITM exhibits superior capability in capturing finer details, making it more suitable for attribute understanding tasks. (3) the attribute information in the captions used for fine-tuning plays a crucial role in attribute understanding. We hope this work can help guide future progress in fine-grained visual understanding of large vision-language models. the code will be available at Attribute-Comprehension-of-VLMs.
Species diversity is one of the major differences between animal action recognition and human action recognition, resulting in a series of challenges, e.g., action manifestation diversity, concurrent actions, and long...
详细信息
ISBN:
(纸本)9789819785100;9789819785117
Species diversity is one of the major differences between animal action recognition and human action recognition, resulting in a series of challenges, e.g., action manifestation diversity, concurrent actions, and long-tailed distribution in datasets. As the same action can be manifested significantly differently among animal species due to their physiological differences, it is crucial for models to distinctively learn various visual content under the same label with species-aware perspectives. However, previous works mainly applied single-species recognition methods to animal datasets, without considering species diversity to address animal action recognition. To fill this gap, we propose a novel animal action recognition approach with specific species guidance by exploring pre-trained vision-language knowledge, namely Species-Aware Guidance (SAG). Firstly, we add word-level species semantics to visual embeddings as guidance, leading the model to focus on relevant regions of target animals in subsequent visual understanding. then, we apply spatiotemporal modeling in both global and local granularity via a two-branch module to obtain a cross-modal video representation. Finally, sentence-level species-aware semantics is fused with action labels as an overall query, guiding the video representation to output the final action label via the decoder. On two widely used public benchmarks of animal action recognition, for both single-label and multi-label scenarios, SAG archives state-of-the-art performance, e.g., Animal Kingdom (up arrow 5.0%), Mammalnet (up arrow 27.0%) compared to existing methods, especially well-alleviating the problem of long-tailed distributions, demonstrating the effectiveness of species guidance under limited data for training.
the task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. this comple...
详细信息
ISBN:
(纸本)9789819784950;9789819784967
the task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. this complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. the source code is available at https://***/fallingnight/PsPG.
Partially observable multi-agent cooperation (POMAC) is a popular task in multi-agent systems, where recognized environments play a vital role for algorithms development and testing like the StarCraft Multi-Agent Chal...
详细信息
ISBN:
(纸本)9789819785049;9789819785056
Partially observable multi-agent cooperation (POMAC) is a popular task in multi-agent systems, where recognized environments play a vital role for algorithms development and testing like the StarCraft Multi-Agent Challenge. However, POMAC in real world often faces more complex situations beyond the simulation scope of current environments, such as asynchronous cooperation, which largely limits the development of multi-agent cooperation algorithms. To cope withthis gap, we propose WarGame Challenge (WGC), which provides four sub-environments to reflect reality-inspired characteristics in POMAC, i.e., cooperation with asynchronous actions, strongly stochastic environments, changeable agents, and asymmetric opponents. Along withthe benchmark, we embed Pymarl package and provide baseline multi-agent reinforcement learning algorithms for researchers' use. the overall codes and projects will be released after the paper review process.
In the field of autonomous driving, environmental perception is crucial for driving safety. Addressing the limitations of existing visual perception methods in complex scenarios, this study proposes a deformable depth...
详细信息
ISBN:
(纸本)9789819787913;9789819787920
In the field of autonomous driving, environmental perception is crucial for driving safety. Addressing the limitations of existing visual perception methods in complex scenarios, this study proposes a deformable depth visual perception framework based on a multi-camera system. the framework processes multi-camera data through a feature extraction network to generate and fuse multi-scale features. And a deformable depth prediction mechanism incorporating self-vehicle temporal difference features is introduced to enhance the accuracy of the model in depth prediction. Experimental results show that on the NuScenes dataset, our method achieves a detection accuracy (mAP) of 0.508 using only 5 random cameras out of 6, surpassing existing technologies such as Lift-Splat (0.446), RC-BEVFusion (0.476), and SOGDet-SE (0.474). Future research will focus on improving the prediction accuracy of distant vehicles to further enhance the performance of the model.
Accurate polyp segmentation is crucial for the early detection of colorectal cancer. However, existing polyp detection methods sometimes ignore multi-directional features and the drastic scale changes of concealed tar...
详细信息
ISBN:
(纸本)9789819784950;9789819784967
Accurate polyp segmentation is crucial for the early detection of colorectal cancer. However, existing polyp detection methods sometimes ignore multi-directional features and the drastic scale changes of concealed targets. To address these challenges, we design an Orthogonal Direction Enhancement and Scale Aware Network (ODC-SA Net) for polyp segmentation. the Orthogonal Direction Convolutional (ODC) block can extract multi-directional features using transposed rectangular convolution kernels through forming sets of orthogonal feature vector basis, which solves the issue of random feature direction changes. Additionally, the Multi-scale Fusion Attention (MSFA) mechanism is proposed to emphasize scale changes in both spatial and channel dimensions, enhancing the segmentation accuracy for polyps of varying sizes. Extraction with Re-attention (ERA) module is used to re-combine effective features, and Shallow Reverse Attention (SRA) mechanism is used to enhance polyp edge with low level information. A large number of experiments conducted on public datasets have demonstrated that the performance of this model is superior to state-of-the-art methods.
Adapting pre-trained models to open classes is a challenging problem in machine learning. vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, wh...
详细信息
ISBN:
(纸本)9789819786190;9789819786206
Adapting pre-trained models to open classes is a challenging problem in machine learning. vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting withthe test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. the code is available at https://***/gaozhengqing/TTPT.
Local feature matching, which identifies correspondences between image pairs, remains a fundamental challenge in computervision. Current methods usually utilize multi-scale feature fusion to refine reference areas an...
详细信息
ISBN:
(纸本)9789819784981;9789819784998
Local feature matching, which identifies correspondences between image pairs, remains a fundamental challenge in computervision. Current methods usually utilize multi-scale feature fusion to refine reference areas and filter out irrelevant features. However, relying solely on agent loss for supervising upper-level features can reduce refinement accuracy. In addition, the variance in significance among features within the reference region is often overlooked. In this paper, we propose an approach termed Cascaded Supervision-Neighborhood ConsistencyProbabilistic Modeling that generates more accurate reference ranges for feature matching. Specifically, the proposed method first directs cascading supervision of the matching results at various scales, enabling more precise refinement of regions. then, it aggregates matching results at each scale to maintain neighborhood consistency. Finally, probabilistic modeling of the refined reference region is employed, focusing more on relevant features. Extensive experiments conducted on four popular benchmarks demonstrate that our method achieves state-of-the-art and comparable performance.
暂无评论