Multi-label Pedestrian Attribute Recognition (PAR) involves identifying a series of semantic attributes in person images. Existing PAR solutions typically rely on CNN as the backbone network to extract pedestrian feat...
详细信息
Multi-label Pedestrian Attribute Recognition (PAR) involves identifying a series of semantic attributes in person images. Existing PAR solutions typically rely on CNN as the backbone network to extract pedestrian features. Unfortunately, CNNs process only one adjacent region at a time, resulting in the disappearance of long-range relations between different attribute-specific regions. To address this limitation, we adopt the Vision Transformer (ViT) instead of CNN as the backbone for PAR, aiming to build long-range relations and extract more robust features. However, PAR suffers from an inherent attribute imbalance issue, causing ViT to naturally focus more on attributes that appear frequently in the training set and ignore some pedestrian attributes that appear less. The native features extracted by ViT are not able to tolerate the imbalance attribute distribution issue. To tackle this issue, we propose a novel component and a dual-level loss: the Selective Feature Activation Method (SFAM), the Orthogonal Feature Activation Loss (OFALoss), and Orthogonal Weight Regularization Loss (OWRLoss). SFAM smartly suppresses the more informative attribute-specific features, thus compelling the PAR model to pay greater attention to attribute-specific regions that are often overlooked. The proposed OFALoss enforces an orthogonal constraint on the original feature extracted by ViT and the suppressed features from SFAM, promoting the comprehensiveness of feature representation in each attribute-specific region. Furthermore, OWRLoss is employed for decreasing correlations among entries of the last shared classification layer, which can alleviate the highly correlated of weight vectors caused by non-uniform distribution. This can prevent excessive mutual interference among different attributes during attribute recognition. Our model-agnostic approach is plug-and-play, requiring no additional training parameters in the training process. We conduct experiments on several benchmark P
In noisy label learning, estimating noisy class posteriors plays a fundamental role for developing consistent classifiers, as it forms the basis for estimating clean class posteriors and the transition matrix. Existin...
详细信息
Image inpainting has achieved remarkable progress and inspired abundant methods, where the critical bottleneck is identified as how to fulfill the high-frequency structure and low-frequency texture information on the ...
详细信息
Emerging technologies in sixth generation (6G) of wireless communications, such as terahertz communication and ultra-massive multiple-input multiple-output, present promising prospects. Despite the high data rate pote...
详细信息
Dual-view gaze target estimation in classroom environments has not been thoroughly explored. Existing methods lack consideration of depth information, primarily focusing on 2D image information and neglecting the late...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Dual-view gaze target estimation in classroom environments has not been thoroughly explored. Existing methods lack consideration of depth information, primarily focusing on 2D image information and neglecting the latent 3D spatial context, which could lead to suboptimal transformation and cause the gaze cone to intersect with an incorrect object. This paper introduces a novel dual-view gaze target estimation method tailored for classroom settings, leveraging depth-enhanced spatial transformations. By formulating a depth-enhanced 2D space, our method uses depth-enhanced spatial transformation to accurately project students’ gaze cones to the teacher-oriented image. Additionally, we collected a dataset named DVSGE, specifically for student gaze target estimation in dual-view classroom images. Experimental results demonstrate significant performance improvements of 9.8% in AUC and 19.9% in L2-Distance for our method, surpassing existing methods.
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-te...
详细信息
The vulnerability of 3D point cloud analysis to unpredictable rotations poses an open yet challenging problem: orientation-aware 3D domain generalization. Cross-domain robustness and adaptability of 3D representations...
详细信息
Structure from motion has attracted a lot of research in recent years, with new state-of-the-art approaches coming almost every year. One of its advantages over 3D reconstruction is that it can be used for any cameras...
详细信息
Sequential pattern mining (SPM) with gap constraints (or repetitive SPM or tandem repeat discovery in bioinformatics) can find frequent repetitive subsequences satisfying gap constraints, which are called positive seq...
详细信息
Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of...
详细信息
暂无评论