The rapid increase in cases of non-alcoholic fatty liver disease (NAFLD) in recent years has raised significant public concern. Accurately identifying tissue alteration regions is crucial for the diagnosis of NAFLD, b...
详细信息
ISBN:
(纸本)9798350353006
The rapid increase in cases of non-alcoholic fatty liver disease (NAFLD) in recent years has raised significant public concern. Accurately identifying tissue alteration regions is crucial for the diagnosis of NAFLD, but this task presents challenges in pathology image analysis, particularly with small-scale datasets. Recently, the paradigm shift from full fine-tuning to prompting in adapting vision foundation models has offered a new perspective for small-scale data analysis. However, existing prompting methods based on task-agnostic prompts are mainly developed for generic image recognition, which fall short in providing instructive cues for complex pathology images. In this paper, we propose Quantitative Attribute-based Prompting (QAP), a novel prompting method specifically for liver pathology image analysis. QAP is based on two quantitative attributes, namely K-function-based spatial attributes and histogram-based morphological attributes, which are aimed for quantitative assessment of tissue states. Moreover, a conditional prompt generator is designed to turn these instance-specific attributes into visual prompts. Extensive experiments on three diverse tasks demonstrate that our task-specific prompting method achieves better diagnostic performance as well as better interpretability. Code is available at https://***/7LFB/QAP.
This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computervision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have s...
详细信息
ISBN:
(纸本)9798350353006
This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computervision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outper-forming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.
Facial expression recognition (FER) is an important task in computervision, having practical applications in areas such as human-computer interaction, education, healthcare, and online monitoring. In this challenging...
详细信息
ISBN:
(纸本)9798350307443
Facial expression recognition (FER) is an important task in computervision, having practical applications in areas such as human-computer interaction, education, healthcare, and online monitoring. In this challenging FER task, there are three key issues especially prevalent: inter-class similarity, intra-class discrepancy, and scale sensitivity. While existing works typically address some of these issues, none have fully addressed all three challenges in a unified framework. In this paper, we propose a two-stream Pyramid crOss-fuSion TransformER network (POSTER), that aims to holistically solve all three issues. Specifically, we design a transformer-based cross-fusion method that enables effective collaboration of facial landmark features and image features to maximize proper attention to salient facial regions. Furthermore, POSTER employs a pyramid structure to promote scale invariance. Extensive experimental results demonstrate that our POSTER achieves new state-of-the-art results on RAF-DB (92.05%), FERPlus (91.62%), as well as AffectNet 7 class (67.31%) and 8 class (63.34%). Code is available at https://***/zczcwh/POSTER.
In this paper, we learn to classify visual object instances, incrementally and via self-supervision (self-incremental). Our learner observes a single instance at a time, which is then discarded from the dataset. Incre...
详细信息
This paper addresses the critical challenges of sparsity and occlusion in LiDAR-based 3D object detection. Current methods often rely on supplementary modules or specific architectural designs, potentially limiting th...
详细信息
ISBN:
(纸本)9798350353006
This paper addresses the critical challenges of sparsity and occlusion in LiDAR-based 3D object detection. Current methods often rely on supplementary modules or specific architectural designs, potentially limiting their applicability to new and evolving architectures. To our knowledge, we are the first to propose a versatile technique that seamlessly integrates into any existing framework for 3D Object Detection, marking the first instance of Weak-to-Strong generalization in 3D computervision. We introduce a novel framework, X-Ray Distillation with Object-Complete Frames, suitable for both supervised and semi-supervised settings, that leverages the temporal aspect of point cloud sequences. This method extracts crucial information from both previous and subsequent LiDAR frames, creating Object-Complete frames that represent objects from multiple viewpoints, thus addressing occlusion and sparsity. Given the limitation of not being able to generate Object-Complete frames during online inference, we utilize Knowledge Distillation within a Teacher-Student framework. This technique encourages the strong Student model to emulate the behavior of the weaker Teacher, which processes simple and informative Object-Complete frames, effectively offering a comprehensive view of objects as if seen through X-ray vision. Our proposed methods surpass state-of-the-art in semi-supervised learning by 1-1.5 mAP and enhance the performance of five established supervised models by 1-2 mAP on standard autonomous driving datasets, even with default hyperparameters. Code for Object-Complete frames is available here: https://***/sakharok13/X-Ray-TeacherPatching-Tools.
Distribution shift can have fundamental consequences such as signaling a change in the operating environment or significantly reducing the accuracy of downstream models. Thus, understanding such distribution shifts is...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Distribution shift can have fundamental consequences such as signaling a change in the operating environment or significantly reducing the accuracy of downstream models. Thus, understanding such distribution shifts is critical for examining and hopefully mitigating the effect of such a shift. Most prior work has focused on either natively handling distribution shift (e.g., Domain Generalization) or merely detecting a shift while assuming any detected shift can be understood and handled appropriately by a human operator. For the latter, we hope to aid in these manual mitigation tasks by explaining the distribution shift to an operator. To this end, we suggest two methods: providing a set of interpretable mappings from the original distribution to the shifted one or providing a set of distributional counterfactual examples. We provide preliminary experiments on these two methods, and discuss important concepts and challenges for moving towards a better understanding of image-based distribution shifts.
We present a key point-based activity recognition framework, built upon pre-trained human pose estimation and facial feature detection models. Our method extracts complex static and movement-based features from key fr...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We present a key point-based activity recognition framework, built upon pre-trained human pose estimation and facial feature detection models. Our method extracts complex static and movement-based features from key frames in videos, which are used to predict a sequence of key-frame activities. Finally, a merge procedure is employed to identify robust activity segments while ignoring outlier frame activity predictions. We analyze the different components of our framework via a wide array of experiments and draw conclusions with regards to the utility of the model and ways it can be improved. Results show our model is competitive, taking the 11th place out of 27 teams submitting to Track 3 of the 2022 AI City Challenge.
The search for interpretable directions in latent spaces of pre-trained Generative Adversarial Networks (GANs) has become a topic of interest. These directions can be utilized to perform semantic manipulations on the ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The search for interpretable directions in latent spaces of pre-trained Generative Adversarial Networks (GANs) has become a topic of interest. These directions can be utilized to perform semantic manipulations on the GAN generated images. The discovery of such directions is performed either in a supervised way, which requires manual annotation or pre-trained classifiers, or in an unsupervised way, which requires the user to interpret what these directions represent. In this work, we propose a framework that finds a specific manipulation direction using only a single simple sketch drawn on an image. Our method finds directions consisting of channels in the style space of the StyleGAN2 architecture responsible for the desired edits and performs image manipulations comparable with state-of-the-art methods.
The proceedings contain 166 papers. The topics discussed include: applying computervision to analyze self-injurious behaviors in children with autism spectrum disorder;underwater image enhancement and object detectio...
ISBN:
(纸本)9798331536626
The proceedings contain 166 papers. The topics discussed include: applying computervision to analyze self-injurious behaviors in children with autism spectrum disorder;underwater image enhancement and object detection: are poor object detection results on enhanced images due to missing human labels?;enhancing weakly-supervised object detection on static images through (hallucinated) motion;a zero-shot learning approach for ephemeral gully detection from remote sensing using vision language models;Attrivision: advancing generalization in pedestrian attribute recognition using CLIP;human gaze improves vision transformers by token masking;SSTAR: skeleton-based spatio-temporal action recognition for intelligent video surveillance and suicide prevention in metro stations;and offline signature verification in the banking domain.
The semantic segmentation of agricultural aerial images is very important for the recognition and analysis of farmland anomaly patterns, such as drydown, endrow, nutrient deficiency, etc. Methods for general semantic ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The semantic segmentation of agricultural aerial images is very important for the recognition and analysis of farmland anomaly patterns, such as drydown, endrow, nutrient deficiency, etc. Methods for general semantic segmentation such as Fully Convolutional Networks can extract rich semantic features, but are difficult to exploit the long-range information. Recently, vision Transformer architectures have made outstanding performances in image segmentation tasks, but transformer-based models have not been fully explored in the field of ***, we propose a novel architecture called Agricultural Aerial Transformer (AAFormer) to solve the semantic segmentation of aerial farmland images. We adopt Mix Transformer (MiT) in the encoder stage to enhance the ability of field anomaly patternrecognition and leverage the Squeeze-and-Excitation (SE) module in the decoder stage to improve the effectiveness of key channels. The boundary maps of farmland are introduced into the decoder. Evaluated on the Agriculture-vision validation set, the mIoU of our proposed model reaches 45.44%.
暂无评论