We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks. Code and dataset at https://***/OnderT/XoFTR.
The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving...
详细信息
ISBN:
(纸本)9781665445092
The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving ambiguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computervision models fully leverage this fact. In this work, we propose a holistically trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.
Randomized smoothing (RS) is a well known certified defense against adversarial attacks, which creates a smoothed classifier by predicting the most likely class under random noise perturbations of inputs during infere...
Randomized smoothing (RS) is a well known certified defense against adversarial attacks, which creates a smoothed classifier by predicting the most likely class under random noise perturbations of inputs during inference. While initial work focused on robustness to ℓ 2 norm perturbations using noise sampled from a Gaussian distribution, subsequent works have shown that different noise distributions can result in robustness to other ℓ p norm bounds as well. In general, a specific noise distribution is optimal for defending against a given ℓ p norm based attack. In this work, we aim to improve the certified adversarial robustness against multiple perturbation bounds simultaneously. Towards this, we firstly present a novel certification scheme, that effectively combines the certificates obtained using different noise distributions to obtain optimal results against multiple perturbation bounds. We further propose a novel training noise distribution along with a regularized training scheme to improve the certification within both ℓ 1 and ℓ 2 perturbation norms simultaneously. Contrary to prior works, we compare the certified robustness of different training algorithms across the same natural (clean) accuracy, rather than across fixed noise levels used for training and certification. We also empirically invalidate the argument that training and certifying the classifier with the same amount of noise gives the best results. The proposed approach achieves improvements on the ACR (Average Certified Radius) metric across both ℓ 1 and ℓ 2 perturbation bounds. Code available at https://***/valiisc/NU-Certified-Robustness
As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from the original VQA task as Tex...
详细信息
ISBN:
(纸本)9781665401913
As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from the original VQA task as Text-VQA requires large amounts of scene-text relationship understanding, in addition to the cross-modal grounding capability. In this paper, we propose Localize, Group, and Select (LOGOS), a novel model which attempts to tackle this problem from multiple aspects. LOGOS leverages two grounding tasks to better localize the key information of the image, utilizes scene text clustering to group individual OCR tokens, and learns to select the best answer from different sources of OCR (Optical Character recognition) texts. Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data. Ablation studies and analysis demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.
Adversarial examples have pointed out Deep Neural Network's vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn...
详细信息
ISBN:
(纸本)9781665445092
Adversarial examples have pointed out Deep Neural Network's vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn with classical loss functions. We propose a new framework for binary classification, based on optimal transport, which integrates this Lipschitz constraint as a theoretical requirement. We propose to learn 1-Lipschitz networks using a new loss that is an hinge regularized version of the Kantorovich-Rubinstein dual formulation for the Wasserstein distance estimation. This loss function has a direct interpretation in terms of adversarial robustness together with certifiable robustness bound. We also prove that this hinge regularized version is still the dual formulation of an optimal transportation problem, and has a solution. We also establish several geometrical properties of this optimal solution, and extend the approach to multi-class problems. Experiments show that the proposed approach provides the expected guarantees in terms of robustness without any significant accuracy drop. The adversarial examples, on the proposed models, visibly and meaningfully change the input providing an explanation for the classification.
Most existing distance metric learning approaches use fully labeled data to learn the sample similarities in an embedding space. We present a self-training framework, SLADE, to improve retrieval performance by leverag...
详细信息
ISBN:
(纸本)9781665445092
Most existing distance metric learning approaches use fully labeled data to learn the sample similarities in an embedding space. We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings. We use self-supervised representation learning to initialize the teacher model. To better deal with noisy pseudo labels generated by the teacher network, we design a new feature basis learning component for the student network, which learns basis functions of feature representations for unlabeled data. The learned basis vectors better measure the pairwise similarity and are used to select high-confident samples for training the student network. We evaluate our method on standard retrieval benchmarks: CUB-200, Cars196 and In-shop. Experimental results demonstrate that with additional unlabeled data, our approach significantly improves the performance over the state-of-the-art methods.
Face presentation attack detection (PAD) plays a vital role in face recognition systems. Many previous face anti-spoofing methods mainly focus on the 2D face representation attacks, which however, suffer from great pe...
详细信息
ISBN:
(纸本)9781665401913
Face presentation attack detection (PAD) plays a vital role in face recognition systems. Many previous face anti-spoofing methods mainly focus on the 2D face representation attacks, which however, suffer from great performance degradation when facing high-fidelity 3D mask attacks. To address this issue, we propose a novel dual-stream framework consisting of the vanilla convolution stream and the central difference convolution stream. These two streams complement each other and learn more comprehensive features for 3D mask attacks detection. Moreover, we extend 3D PAD to a multi-classification task that contains real face, plaster attack and transparent attack, and utilize various data augmentations and label smoothing techniques to improve the generalizability on unseen attacks. The proposed method achieved the second place in the Chalearn 3D High-Fidelity Mask Face Presentation Attack Detection Challenge@ICCV2021 with a score of 3.15 (ACER).
Negative emotions have been identified as significant factors influencing driver behavior, easily leading to extremely serious traffic accidents. Hence, there is a pressing need to develop an automatic emotion classif...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Negative emotions have been identified as significant factors influencing driver behavior, easily leading to extremely serious traffic accidents. Hence, there is a pressing need to develop an automatic emotion classification method for driver health monitoring and road safety improvement. Most of the existing methods predominantly focus on single modalities, resulting in suboptimal classification performance due to the underutilization of heterogeneous information. In this work, we propose a novel non-contacting dual-modality driver emotion classification network (DECNet) to address these limitations. DECNet consists of three key modules: 1) facial video modality processing module; 2) driving behavior modality processing module; 3) fusion decision module. Meanwhile, we introduce a combined multi-task learning strategy within DECNet to improve the efficacy in the driver emotion classification task. To evaluate the effectiveness of the proposed DECNet, we conducted experiments on the PPB-Emo dataset, the experimental results showcase the superiority in terms of accuracy (⩾ 6.12% Acc-7) and F1-score (⩾ 7.25% F1-7) compared to existing state-of-the art methods. The model and code will be available at https://***/fqfqngxhs/***
We study the problem of unsupervised discovery and segmentation of object parts, which, as an intermediate local representation, are capable of finding intrinsic object structure and providing more explainable recogni...
详细信息
ISBN:
(纸本)9781665445092
We study the problem of unsupervised discovery and segmentation of object parts, which, as an intermediate local representation, are capable of finding intrinsic object structure and providing more explainable recognition results. Recent unsupervised methods have greatly relaxed the dependency on annotated data which are costly to obtain, but still rely on additional information such as object segmentation mask or saliency map. To remove such a dependency and further improve the part segmentation performance, we develop a novel approach by disentangling the appearance and shape representations of object parts followed with reconstruction losses without using additional object mask information. To avoid degenerated solutions, a bottleneck block is designed to squeeze and expand the appearance representation, leading to a more effective disentanglement between geometry and appearance. Combined with a self-supervised part classification loss and an improved geometry concentration constraint, we can segment more consistent parts with semantic meanings. Comprehensive experiments on a wide variety of objects such as face, bird, and PASCAL VOC objects demonstrate the effectiveness of the proposed method.
Prior work in scene graph generation requires categorical supervision at the level of triplets-subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene gra...
详细信息
ISBN:
(纸本)9781665445092
Prior work in scene graph generation requires categorical supervision at the level of triplets-subjects and objects, and predicates that relate them, either with or without bounding box information. However, scene graph generation is a holistic task: thus holistic, contextual supervision should intuitively improve performance. In this work, we explore how linguistic structures in captions can benefit scene graph generation. Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects (e.g. visual properties are mentioned). Captions are a weaker type of supervision than triplets since the alignment between the exhaustive list of human-annotated subjects and objects in triplets, and the nouns in captions, is weak. However, given the large and diverse sources of multimodal data on the web (e.g. blog posts with images and captions), linguistic supervision is more scalable than crowdsourced triplets. We show extensive experimental comparisons against prior methods which leverage instance- and image-level supervision, and ablate our method to show the impact of leveraging phrasal and sequential context, and techniques to improve localization of subjects and objects.
暂无评论