With the increasing prevalence of digital documents in various domains, the demand for efficient and accurate question-answering (QA) systems has grown significantly. Traditional QA models primarily focus on text-base...
详细信息
Currently, car parking assistance is limited to the small angle rear view imaging of the reversing radar, which can only provide the driver with a limited range of vision and is prone to safety hazards. It is necessar...
详细信息
Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works...
详细信息
ISBN:
(纸本)9783031278174;9783031278181
Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works focus on designing complex modules, the so-called necks and heads, over simple backbones, while bringing a huge computational burden. In this paper, we bring a new insight: vision Transformer itself is an all-in-one FGVC framework that consists of basic Backbone for feature extraction, Neck for further feature enhancement and Head for selecting discriminative feature. We delve into the feature extraction and representation pattern of ViT for FGVC and empirically show that simply recombining the original ViT structure to leverage multi-level semantic representation without introducing any other parameters is able to achieve higher performance. Under such insight, we proposed RecViT, a simple recombination and modification of original ViT, which can capture multi-level semantic features and facilitate fine-grained recognition. In RecViT, the deep layers of the original ViT are served as Head, a few middle layers as Neck and shallow layers as Backbone. In addition, we adopt an optional Feature processing Module to enhance discriminative feature representation at each semantic level and align them for final recognition. With the above simple modifications, RecViT obtains significant improvement in accuracy in FGVC benchmarks: CUB-200-2011, Stanford Cars and Stanford Dogs.
Solder cracks are caused by repeated expansion and contraction due to temperature changes. When designing a new electronic board, a heat shock test is performed on the electronic board to identify areas where solder c...
详细信息
In this work, we propose a no-reference image quality evaluation approach, aiming to solve the problem that the traditional convolutional neural network is insufficient to express the global information of the image. ...
详细信息
Plant diseases, particularly affecting fruit crops, pose a significant challenge to the worldwide supply of fresh food due to their direct impact on the quality of fruits, resulting in an overall decline in agricultur...
详细信息
Texture image classification is a fundamental and challenging visual task and has wide range of applications. Binary pattern methods play an important role in texture feature extraction due to its ease of implementati...
详细信息
Jewelry recognition is a complex task due to the different styles and designs of accessories. Precise descriptions of the various accessories is something that today can only be achieved by experts in the field of jew...
详细信息
Due to its wide applications in psychology, healthcare, and safety, facial emotion recognition is a crucial machine vision issue that has to be studied. Emotion detection from facial expressions is considered a challe...
详细信息
Cross-modality crowd counting is one of the most essential tasks in multimedia and imageprocessing, which usually uses multi-sensor information as input in neural networks. Various approaches have been proposed to ex...
详细信息
暂无评论